All insights
AI Product Building AI Agents

Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure

Saturated evals waste compute without providing signal; more intelligent models or changed desired behaviors make old evals irrelevant, requiring regular pruning alongside addition

@Vtrivedy10 (Viv) — Better Harness: A Recipe for Harness Hill-Climbing with Evals · · 4 connections

Eval suites shouldn’t grow monotonically — spring cleaning is good. Regularly assessing whether an eval is still useful given more intelligent models or different desired agent behaviors is essential maintenance. Evals that every model already passes (saturated) waste compute; evals that test for behaviors you no longer want exert the wrong Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development.

This connects to Build for the model six months from now, not the model of today: if today’s scaffolding becomes tech debt against the next model, today’s eval suite does too. An eval designed around GPT-4-level tool selection may be trivially passed by Claude Sonnet 4.6, contributing zero signal while consuming budget. The decision is to treat eval curation as an ongoing discipline — not a one-time setup — with regular reviews that remove, replace, and rebalance the suite alongside model and product evolution.