AI Product Building Architecture Knowledge Systems

Context inefficiency compounds three penalties: cost, latency, and quality degradation

Every wasted token in an LLM context window doesn't just cost money — it slows responses and degrades output quality, creating a triple tax on production agents

@nicbstme — The LLM Context Tax: Best Tips for Tax Avoidance · Mar 5, 2026 · 6 connections

The “context tax” isn’t just financial. Every unnecessary token sent to an LLM creates three compounding penalties: higher API costs (output tokens cost 5x input), slower response times from processing irrelevant information, and degraded output quality as the model’s attention dilutes across noise. This reframes The context window is the fundamental constraint — everything else follows from a capacity problem to an economics problem — even when content fits in the window, bloated context degrades performance.

Practical mitigations map to architectural decisions: preprocessing data before tokenization (HTML to Markdown yields 90%+ token reduction), delegating to subagents for 67% fewer tokens through context isolation, and storing tool outputs as Files are the universal interface between humans and agents rather than embedding them in context (Cursor’s approach reduced tokens by 46.9%). At the skill level, A skill's folder structure is its context architecture — the file system is a form of context engineering applies the same principle — splitting a monolithic skill file into a workflow outline plus reference subfolders means the agent only loads detailed content when it reaches the relevant step, avoiding the triple tax on every invocation.

Connected Insights

References (3)

→ The context window is the fundamental constraint — everything else follows → Files are the universal interface between humans and agents → A skill's folder structure is its context architecture — the file system is a form of context engineering

Referenced by (3)

← Separate tool registration from tool exposure — install broadly, reveal narrowly ← KV cache hit rate is the most critical metric for production agents ← Separate research from implementation to preserve context quality