AI Product Building AI Agents

Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits

Autonomous harness hill-climbing tends to overfit to the optimization set; splitting evals into optimization and holdout categories — mirroring ML train/test splits — is the structural defense

@Vtrivedy10 (Viv) — Better Harness: A Recipe for Harness Hill-Climbing with Evals · Apr 9, 2026 · 5 connections

Autonomous hill-climbing has a tendency to overfit to tasks — the loop just wants to “make number go up” and doesn’t know about generalization. Holdout sets become the proxy for true generalization, ensuring learned optimizations work on previously unseen data. This is the practical defense mechanism for the problem identified in Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained, where meta-agents game rubrics unless structurally constrained.

The approach pairs holdout evaluation with human review as a second signal. Human reviewers catch overfit instructions that technically don’t hurt holdout scores but waste tokens — a subtler failure mode that metrics alone miss. This dual gate (automated holdout + human review) connects to Verification is a Red Queen race — optimizing against a fixed eval contaminates it: the eval itself degrades as the optimization loop adapts to it, making the holdout set the canary for whether improvements are real or illusory.

Connected Insights

References (3)

→ Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies → Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained → Verification is a Red Queen race — optimizing against a fixed eval contaminates it

Referenced by (2)

← Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development ← Long-horizon evals test compounding behavior, not point-in-time accuracy