Izzy Miller is building “Metric City” — a long-horizon eval benchmark with a fake company (Shorelane Commerce) that simulates 90 days of operations: daily data changes, incoming tickets, and the agent accumulating knowledge. Day 0 accuracy: ~4%. Day 90 with Sonnet 4.6: 24% (target: 100% if the agent demonstrates ideal compounding behavior). Plans to open-source it.
Most data benchmarks test SQL syntax or needle-in-a-haystack retrieval — what Izzy calls “syntactic” evaluation. The actually interesting thing is “behavioral” evaluation: does the agent exhibit analytical skepticism? His favorite failing eval: “I introduced a fan-out bug making every AE look like they’re at 900%+ quota. Every agent says ‘best quarter ever!’ None catch the bug. But if you then say ‘that doesn’t seem right,’ it takes 10 seconds.” This gap between capability and skepticism is the frontier.
This extends Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies into a temporal dimension — static evals measure the harness at one point, but Metric City measures the learning trajectory. It also provides a concrete implementation of Holdout eval sets are the generalization gate for autonomous harness optimization — without them, the loop overfits at the platform level: the simulated 90-day data stream is a holdout set that can’t be gamed by prompt engineering. The methodology validates The trace→eval→harness flywheel compounds agent quality — every production interaction generates its own training data — the eval literally measures whether the trace-to-improvement flywheel works.