Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development
Each eval shapes agent behavior like a selection pressure; accumulating tests without strategic purpose creates 'an illusion of improving your agent' while distorting development in unproductive directions, and correctness alone misleads because agents that succeed inefficiently create hidden cost
LangChain — How We Build Evals for Deep Agents · · 11 connections
Connected Insights
References (5)
→ Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure → Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits → Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained → Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance → Verification is a Red Queen race — optimizing against a fixed eval contaminates it
Referenced by (6)
← Private evals should measure business outcomes that matter — not external benchmarks ← Verification is a Red Queen race — optimizing against a fixed eval contaminates it ← Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure ← Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies ← LLM-as-judge must be calibrated against human judgment — uncalibrated judges are worse than no judges ← The context flywheel is a Day 90 moat — Day 0 comparisons are misleading