Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies
The analogy between ML training and agent development is structural: evals encode desired behavior like training data encodes ground truth, and the same principles (data quality, curation, train/test splits) determine outcomes
@Vtrivedy10 (Viv) — Better Harness: A Recipe for Harness Hill-Climbing with Evals · · 8 connections
Connected Insights
References (3)
→ Agents learn at three distinct layers — model weights, harness code, and context configuration → Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development → A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one
Referenced by (5)
← Eval suites must shrink, not just grow — spring cleaning prevents stale behavioral pressure ← Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits ← Meta-agents that autonomously optimize task agents beat hand-engineered harnesses on production benchmarks ← The trace→eval→harness flywheel compounds agent quality — every production interaction generates its own training data ← Long-horizon evals test compounding behavior, not point-in-time accuracy