AI Product Building AI Agents Architecture

Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development

Each eval shapes agent behavior like a selection pressure; accumulating tests without strategic purpose creates 'an illusion of improving your agent' while distorting development in unproductive directions, and correctness alone misleads because agents that succeed inefficiently create hidden cost

LangChain — How We Build Evals for Deep Agents · Apr 6, 2026 · 11 connections

LangChain’s Deep Agents eval methodology starts from a counterintuitive premise: “More evals ≠ better agents.” Each eval acts as a behavioral pressure vector — it doesn’t just measure the agent, it shapes it. Poorly chosen evals distort development in unproductive directions, creating “an illusion of improving your agent” while the eval suite doesn’t reflect production-relevant capabilities.

Two practical principles emerge. First, correctness alone misleads: an agent that succeeds in 6 steps with 14 seconds of latency produces identical correctness scores to one that succeeds in 4 steps with 8 seconds. Only measuring step ratio, tool call ratio, and latency ratio against “ideal trajectories” reveals the operational difference. Second, taxonomy beats aggregation: grouping evals by what they test (file operations, retrieval, tool use) rather than where they came from creates actionable visibility between the extremes of a single score and overwhelming per-test noise.

This reframes Verification is a Red Queen race — optimizing against a fixed eval contaminates it from a defensive problem (evals decay) to an offensive one (evals actively shape). It’s not just that optimizing against fixed evals contaminates them — it’s that the choice of which evals to include is itself a design decision that steers agent behavior. The efficiency measurement connects to Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance — Deep Agents measures not just whether the agent solved the problem but whether it solved it within practical resource bounds. And the “include SDK unit tests = no signal” finding reinforces Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained: noise in the eval suite doesn’t just waste time, it actively degrades the optimization signal. This also means Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure — stale or saturated evals exert pressure without providing signal, and Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits is the structural defense when using evals to autonomously improve harnesses.

Connected Insights

References (5)

→ Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure → Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits → Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained → Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance → Verification is a Red Queen race — optimizing against a fixed eval contaminates it

Referenced by (6)

← Private evals should measure business outcomes that matter — not external benchmarks ← Verification is a Red Queen race — optimizing against a fixed eval contaminates it ← Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure ← Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies ← LLM-as-judge must be calibrated against human judgment — uncalibrated judges are worse than no judges ← The context flywheel is a Day 90 moat — Day 0 comparisons are misleading