All insights
AI Product Building AI Agents Coding Tools

Evaluate agent tools with real multi-step tasks, not toy single-call examples

Weak evaluation tasks hide tool design flaws — strong tasks require chained calls, ambiguity resolution, and verifiable outcomes

Anthropic Engineering — Writing Effective Tools for Agents · · 9 connections

Anthropic’s tool evaluation methodology distinguishes weak tasks (simple single-call lookups) from strong tasks (e.g., “schedule a meeting with the product team next week” — which requires checking calendars, finding availability, resolving conflicts, and sending invitations). Strong tasks require multiple tool calls, ambiguity resolution, and produce verifiable outcomes. The evaluation tracks not just accuracy but tokens used, tool calls made, errors hit, and reasoning quality via interleaved thinking.

This extends Verification is the single highest-leverage practice for agent-assisted coding from code verification to tool design verification — the same principle (give agents a way to check their work) applies to the tools themselves. What agents omit in their reasoning matters as much as what they say, which connects to why Similarity is not relevance — relevance requires reasoning — surface-level tool success (it returned results) doesn’t mean the tool is well-designed (it returned the right results efficiently). The four-step process (prototype, generate eval tasks, run programmatic evals, collaborate with agents to fix failures) is a tool-specific instance of Compound engineering makes each unit of work improve all future work — each evaluation cycle improves the tool for all future uses. Multi-step evaluations also surface Confluence of tendencies produces extreme outcomes — lollapalooza effects emerge when multiple psychological biases push the same direction in agent behavior — compound failures from chained tool calls only emerge under realistic complexity, not isolated single-step tests. Beyond functionality, evaluations should test safety boundaries: Safety enforcement belongs in tool design, not system prompts shows that tool-enforced safety (blocking destructive operations at the API level) scales far more reliably than behavioral compliance with system prompts.