Production agent economics hinge on KV cache hits. When a prompt prefix matches a cached computation, the model skips re-processing those tokens entirely — saving both cost and latency. The critical architectural implication: context must be append-only. Modifying earlier content invalidates the cache forward, so the Manus team masks token logits during decoding to constrain available actions rather than dynamically removing tool definitions from the prompt.
Even small design choices matter: including timestamps to the second destroys cache benefits (cache durations are 5 minutes for Anthropic, 10 minutes for OpenAI), while hour-level precision preserves them. This connects to Context inefficiency compounds three penalties: cost, latency, and quality degradation — cache misses compound all three penalties simultaneously. Combined with crossing the 200K input token pricing cliff (which doubles per-token costs), cache-unaware architectures can be up to 10x more expensive than cache-optimized ones running identical workloads. The concrete design pattern for keeping that prefix stable is to Order the system prompt by volatility to keep prompt prefixes cache-friendly — stable identity first, turn-by-turn data last. Cache economics are one lever inside a larger one: Inference-time compute makes cost-per-outcome a choice — and that's the application layer's counterattack on the labs — once you can spend 10x compute for a better answer, cost per outcome becomes a deliberate allocation decision, and cache-efficiency is part of how an application argues it can spend the customer’s tokens better than the lab would.