AI Product Building Coding Tools

Verification is the single highest-leverage practice for agent-assisted coding

Giving an agent a way to verify its own work 2-3x the quality of output — without verification, you're shipping blind

Boris Cherny + Anthropic Official Best Practices · Jan 15, 2025 · 28 connections

Both the creator of Claude Code (Boris Cherny) and Anthropic’s official documentation converge on the same claim: verification is the single highest-leverage thing you can do. Not better prompts, not more context, not smarter models — just giving the agent a way to check its own work. The quality multiplier is 2-3x.

Verification takes many forms: running test suites, executing bash commands, taking screenshots and comparing to designs, using subagents as reviewers. The key insight is that agents with verification enter a self-correcting loop — they detect their own errors and fix them before presenting results. Without it, plausible-looking output hides edge case failures.

This connects to Declarative beats imperative when working with agents — tests ARE declarative success criteria. Write the test first, then let the agent pass it. The test is both the specification and the verification. It also strengthens the case for Compound engineering makes each unit of work improve all future work: the review phase (40% of time) is systematic verification that catches what the implementation phase missed. Elvis Sun’s Multi-model code review creates adversarial robustness — each model catches what others miss compounds the effect: if verification 2-3x quality, verification by three diverse models compounds further — each model has different blindspots, creating emergent coverage no single verifier achieves. For systematic coverage beyond hand-picked test cases, Property-based testing explores agent input spaces that example-based tests miss applies generative testing to explore input spaces that example-based verification misses. And The 80/99 gap is where AI products die — demo accuracy and production reliability are infinitely far apart shows why verification isn’t optional: without it, you can’t know where you sit on the 80-to-99 accuracy spectrum. Taken to its extreme, verification becomes the entire basis for autonomous operation: A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one shows that a mediocre agent inside a strong verification harness outperforms a smarter agent without one, and Rollback safety nets enable autonomous iteration — not model intelligence identifies automatic rollback as the mechanism that makes this safe — failures cost nothing when they’re instantly undone. But verification itself has a structural limit: Verification is a Red Queen race — optimizing against a fixed eval contaminates it — any fixed eval suite degrades the moment you optimize against it, making verification a permanent race rather than a solved state. For long-horizon agents specifically, The first-draft pattern is the killer app for long-horizon agents — agents produce, humans review describes the practical verification pattern: agents produce comprehensive first drafts (PRs, research reports, incident analyses), humans verify the output. This is verification applied at the workflow level rather than the code level. At the human-research level the same instinct is A loss curve is reassurance, not analysis — pull a hundred failures and read every one — pulling a hundred failures and reading every one is verification by hand, and a metric you’ve never traced back to transcripts is the human equivalent of shipping blind.

Connected Insights

References (10)

→ Compound engineering makes each unit of work improve all future work → Declarative beats imperative when working with agents → The first-draft pattern is the killer app for long-horizon agents — agents produce, humans review → A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one → Multi-model code review creates adversarial robustness — each model catches what others miss → Property-based testing explores agent input spaces that example-based tests miss → Rollback safety nets enable autonomous iteration — not model intelligence → A loss curve is reassurance, not analysis — pull a hundred failures and read every one → The 80/99 gap is where AI products die — demo accuracy and production reliability are infinitely far apart → Verification is a Red Queen race — optimizing against a fixed eval contaminates it

Referenced by (18)

← Research speed is mostly the speed at which you discover you're wrong — which makes tooling a first-class research activity ← Verification is a Red Queen race — optimizing against a fixed eval contaminates it ← Autonomous coding loops need small stories and fast feedback to work ← Multi-model code review creates adversarial robustness — each model catches what others miss ← Every optimization has a shadow regression — guard commands make the shadow visible ← Production agents route routine cases through decision trees, reserving humans for complexity ← Evaluate agent tools with real multi-step tasks, not toy single-call examples ← Rollback safety nets enable autonomous iteration — not model intelligence ← Harness engineering — humans steer, agents execute, documentation is the system of record ← A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one ← Excessive self-regard makes fixable failures persist — people excuse poor performance instead of correcting it ← Weaponize sycophancy with adversarial agent ensembles instead of fighting it ← The first-draft pattern is the killer app for long-horizon agents — agents produce, humans review ← Property-based testing explores agent input spaces that example-based tests miss ← The 80/99 gap is where AI products die — demo accuracy and production reliability are infinitely far apart ← Adversarial branch-walking beats review for planning — walk every design branch until resolved ← Auto-generated narrow monitors beat handwritten broad checks — a tight mesh over the exact shape of the code ← Speed without feedback amplifies errors — agents lack the self-correction mechanism that constrains human mistakes