All insights
AI Product Building Coding Tools

Verification is the single highest-leverage practice for agent-assisted coding

Giving an agent a way to verify its own work 2-3x the quality of output — without verification, you're shipping blind

Boris Cherny + Anthropic Official Best Practices · · 24 connections

Both the creator of Claude Code (Boris Cherny) and Anthropic’s official documentation converge on the same claim: verification is the single highest-leverage thing you can do. Not better prompts, not more context, not smarter models — just giving the agent a way to check its own work. The quality multiplier is 2-3x.

Verification takes many forms: running test suites, executing bash commands, taking screenshots and comparing to designs, using subagents as reviewers. The key insight is that agents with verification enter a self-correcting loop — they detect their own errors and fix them before presenting results. Without it, plausible-looking output hides edge case failures.

This connects to Declarative beats imperative when working with agents — tests ARE declarative success criteria. Write the test first, then let the agent pass it. The test is both the specification and the verification. It also strengthens the case for Compound engineering makes each unit of work improve all future work: the review phase (40% of time) is systematic verification that catches what the implementation phase missed. Elvis Sun’s Multi-model code review creates adversarial robustness — each model catches what others miss compounds the effect: if verification 2-3x quality, verification by three diverse models compounds further — each model has different blindspots, creating emergent coverage no single verifier achieves. For systematic coverage beyond hand-picked test cases, Property-based testing explores agent input spaces that example-based tests miss applies generative testing to explore input spaces that example-based verification misses. And The 80/99 gap is where AI products die — demo accuracy and production reliability are infinitely far apart shows why verification isn’t optional: without it, you can’t know where you sit on the 80-to-99 accuracy spectrum. Taken to its extreme, verification becomes the entire basis for autonomous operation: A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one shows that a mediocre agent inside a strong verification harness outperforms a smarter agent without one, and Rollback safety nets enable autonomous iteration — not model intelligence identifies automatic rollback as the mechanism that makes this safe — failures cost nothing when they’re instantly undone. But verification itself has a structural limit: Verification is a Red Queen race — optimizing against a fixed eval contaminates it — any fixed eval suite degrades the moment you optimize against it, making verification a permanent race rather than a solved state.

Connected Insights

Referenced by (16)

Autonomous coding loops need small stories and fast feedback to work Production agents route routine cases through decision trees, reserving humans for complexity Evaluate agent tools with real multi-step tasks, not toy single-call examples Multi-model code review creates adversarial robustness — each model catches what others miss Harness engineering — humans steer, agents execute, documentation is the system of record Excessive self-regard makes fixable failures persist — people excuse poor performance instead of correcting it Weaponize sycophancy with adversarial agent ensembles instead of fighting it Property-based testing explores agent input spaces that example-based tests miss The 80/99 gap is where AI products die — demo accuracy and production reliability are infinitely far apart A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one Rollback safety nets enable autonomous iteration — not model intelligence Every optimization has a shadow regression — guard commands make the shadow visible Verification is a Red Queen race — optimizing against a fixed eval contaminates it Adversarial branch-walking beats review for planning — walk every design branch until resolved Auto-generated narrow monitors beat handwritten broad checks — a tight mesh over the exact shape of the code Speed without feedback amplifies errors — agents lack the self-correction mechanism that constrains human mistakes