All insights
AI Product Building Architecture Coding Tools

Prompt caching makes long context economically viable

Prefix-matching cache enables 80%+ cost reduction for multi-turn conversations, making rich context systems affordable at scale

Anthropic documentation — Prompt Caching · · 5 connections

Without caching, every turn in a conversation resends the entire context — system prompts, conversation history, tool definitions — at full price. Prompt caching stores processed prefix tokens so subsequent requests reuse them at ~90% discount. The critical constraint is the prefix matching rule: cache hits only work when the beginning of the request is identical. Change anything in the middle and the cache breaks.

This is the economic enabler for The context window is the fundamental constraint — everything else follows — large context windows are only useful if they’re affordable to fill on every turn. It directly shapes how The three-layer AI stack: Memory, Search, Reasoning works in practice: the Memory layer (system prompt, skill files, CLAUDE.md) sits at the prefix where it caches perfectly, while the Search and Reasoning layers add variable content at the end. The ordering constraint (stable content first, variable content last) is an architectural decision driven by economics, connecting to Context is the product, not the model — designing the context system includes designing for cache efficiency. At production scale, KV cache hit rate is the most critical metric for production agents shows this is the single most critical metric — append-only context architecture and stable prefixes determine whether an agent system is economically viable or not.