AutoAgent demonstrates the first concrete evidence that agents can autonomously beat manual harness tuning on production benchmarks. The meta-agent experiments on a task agent’s harness — tweaking prompts, adding tools, refining orchestration — while spinning up thousands of parallel sandboxes to test improvements. The key architectural insight: being good at a domain and being good at improving at that domain are different capabilities, so the meta/task split lets each specialize.
This operationalizes Tool design is continuous observation — see like an agent at scale — instead of a human observing agent behavior and iterating tool design, a meta-agent does it autonomously. The emergent behaviors are striking: the meta-agent independently discovered spot-checking, forced verification loops, test-writing, progressive disclosure, and sub-agent orchestration — all patterns that A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one and Autonomous coding loops need small stories and fast feedback to work describe as critical. The implication for companies is that no team can hand-tune hundreds of domain-specific harnesses, but a meta-agent can.