Evaluate agent tools with real multi-step tasks, not toy single-call examples
Weak evaluation tasks hide tool design flaws — strong tasks require chained calls, ambiguity resolution, and verifiable outcomes
Anthropic Engineering — Writing Effective Tools for Agents · · 9 connections
Connected Insights
References (5)
→ Confluence of tendencies produces extreme outcomes — lollapalooza effects emerge when multiple psychological biases push the same direction → Verification is the single highest-leverage practice for agent-assisted coding → Similarity is not relevance — relevance requires reasoning → Compound engineering makes each unit of work improve all future work → Safety enforcement belongs in tool design, not system prompts
Referenced by (4)
← Autonomous coding loops need small stories and fast feedback to work ← Multi-model code review creates adversarial robustness — each model catches what others miss ← Weaponize sycophancy with adversarial agent ensembles instead of fighting it ← Property-based testing explores agent input spaces that example-based tests miss