AI Product Building Architecture AI Agents

KV cache hit rate is the most critical metric for production agents

Maintaining stable prompt prefixes and append-only context architecture maximizes cache reuse, dramatically reducing both cost and latency for agentic workflows

@nicbstme — The LLM Context Tax: Best Tips for Tax Avoidance · Mar 5, 2026 · 6 connections

Production agent economics hinge on KV cache hits. When a prompt prefix matches a cached computation, the model skips re-processing those tokens entirely — saving both cost and latency. The critical architectural implication: context must be append-only. Modifying earlier content invalidates the cache forward, so the Manus team masks token logits during decoding to constrain available actions rather than dynamically removing tool definitions from the prompt.

Even small design choices matter: including timestamps to the second destroys cache benefits (cache durations are 5 minutes for Anthropic, 10 minutes for OpenAI), while hour-level precision preserves them. This connects to Context inefficiency compounds three penalties: cost, latency, and quality degradation — cache misses compound all three penalties simultaneously. Combined with crossing the 200K input token pricing cliff (which doubles per-token costs), cache-unaware architectures can be up to 10x more expensive than cache-optimized ones running identical workloads. The concrete design pattern for keeping that prefix stable is to Order the system prompt by volatility to keep prompt prefixes cache-friendly — stable identity first, turn-by-turn data last. Cache economics are one lever inside a larger one: Inference-time compute makes cost-per-outcome a choice — and that's the application layer's counterattack on the labs — once you can spend 10x compute for a better answer, cost per outcome becomes a deliberate allocation decision, and cache-efficiency is part of how an application argues it can spend the customer’s tokens better than the lab would.

Connected Insights

References (3)

→ Context inefficiency compounds three penalties: cost, latency, and quality degradation → Inference-time compute makes cost-per-outcome a choice — and that's the application layer's counterattack on the labs → Order the system prompt by volatility to keep prompt prefixes cache-friendly

Referenced by (3)

← Order the system prompt by volatility to keep prompt prefixes cache-friendly ← Prompt caching makes long context economically viable ← Observability is the missing discipline for agent systems — you can't improve what you can't measure