Architecture
AI Product Building55 insights in this topic
55 insights
Context is the product, not the model
Anyone can call the API — differentiation comes from the data you access, skills you build, UX you design, and domain knowledge you encode
Decision traces are the missing data layer — a trillion-dollar gap
Systems store what happened but not why; capturing the reasoning behind decisions creates searchable precedent and a new system of record
A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one
The surrounding machinery — metrics, rollback, scoping, observability — determines autonomous system performance more than model capability
Files are the universal interface between humans and agents
Markdown and YAML files on disk beat databases because agents already know file operations and humans can inspect everything
The three-layer AI stack: Memory, Search, Reasoning
The emerging AI product architecture has three layers — Memory (who is this user), Search (find the right information), Reasoning (navigate complex information) — all running on PostgreSQL
Agents that store error patterns learn continuously without fine-tuning or retraining
Dash's 'GPU-poor continuous learning' separates validated knowledge from error-driven learnings — five lines of code replaces expensive retraining
Structure plus reasoning beats flat similarity for complex domains
Across documents, code, and skills, the same pattern holds: structured knowledge navigated by reasoning outperforms flat indexes searched by similarity
In agent-native architecture, features are prompts — not code
The shift from coding specific functions to describing outcomes that agents achieve by composing atomic tools
Production agents route routine cases through decision trees, reserving humans for complexity
Handle exact matches and known patterns without AI; invoke the model for ambiguity, and route genuinely complex cases to human judgment
Markdown skill files may replace expensive fine-tuning
A SKILL.md file that teaches an agent how to do something specific can match domain-specific fine-tuned models — at zero training cost
Observability is the missing discipline for agent systems — you can't improve what you can't measure
Agent systems need telemetry (token usage, latency, error rates, cost per task) as a first-class engineering concern, not an afterthought bolted on after production failures
Verification is a Red Queen race — optimizing against a fixed eval contaminates it
Eval suites degrade the moment you use them to improve an agent — the agent adapts to the distribution, and the eval stops measuring what it was designed to measure
Agentic search beats RAG for live codebases
Claude Code abandoned RAG and vector DB in favor of letting the agent grep/glob/read — reasoning about where to look outperforms pre-indexed similarity search for code
Similarity is not relevance — relevance requires reasoning
Vector search finds semantically similar content, but what users need is relevant content, and determining relevance requires LLM reasoning, not just pattern matching
Revealed preferences trump stated preferences — track what users do, not what they say
Users' actual behavior (what they click, skip, edit, redo) is the ground truth for product decisions; stated preferences in surveys and interviews systematically mislead
Boring tech wins for AI-native startups — simpler stack means faster AI-assisted shipping
React + Node + TypeScript + Postgres + Redis scales to $1M ARR with 3 engineers; monorepo is a superpower for AI coding assistants
Traces not scores enable agent improvement — without trajectories, improvement rate drops hard
When AutoAgent's meta-agent received only pass/fail scores without reasoning traces, the improvement rate dropped significantly; understanding why matters as much as knowing that
The UI moat collapses — API quality becomes the purchasing criterion
When agents are the primary users of software, beautiful dashboards stop mattering and API design becomes the competitive surface
Agent edits are automatic decision instrumentation — every human correction is a structured signal
When agents propose and humans edit, the delta between proposal and correction captures tacit judgment as first-class data without requiring manual logging
Auto-generated narrow monitors beat handwritten broad checks — a tight mesh over the exact shape of the code
1,000+ AI-generated monitors that each target specific code paths catch more bugs than 10 hand-written checks that cover general categories
Context inefficiency compounds three penalties: cost, latency, and quality degradation
Every wasted token in an LLM context window doesn't just cost money — it slows responses and degrades output quality, creating a triple tax on production agents
Context layers must be living systems, not static artifacts
Unlike semantic layers that rot when maintainers leave, context layers need self-updating feedback loops where agent errors refine the context corpus
Hybrid search is the default, not the exception
Neither keyword nor semantic search alone is complete — combining BM25 and vector search with reranking is the baseline for production systems
Intelligence location — code vs prompts — determines system fragility and flexibility
Critical architectural fork: prompt-driven systems (Pal's 400-line routing prompt) are flexible but break when models change; code-driven systems (our validate-graph.js) are rigid but reliable — best systems need both
Knowledge evolution is the biggest unsolved problem across all graph architectures
Almost nobody has solved how knowledge graphs grow without rotting — most are append-only, auto-decay is too aggressive, and even the best systems only add links without pruning, merging, or detecting contradictions
Knowledge systems need dual-layer storage — narrative depth and structured queries can't share a format
Every system beyond 'markdown files in a folder' discovers that narrative depth (rich prose, context, reasoning) and structured querying (filter, aggregate, cross-reference) need different storage layers with a routing mechanism between them
Metadata consumed by LLMs needs trigger specifications, not human summaries
When an LLM scans metadata to decide what to invoke, the description should specify when to activate — not summarize what the thing does — because LLMs are a fundamentally different consumer than humans
AI is the computer — orchestration across 19 models is the product, not any single model
Perplexity launched a unified agent system orchestrating 19 backend models that delegate tasks, manage files, execute code, and browse the web. The differentiation isn't the models — it's the orchestration. 'The computer is the orchestration system.'
Prompt caching makes long context economically viable
Prefix-matching cache enables 80%+ cost reduction for multi-turn conversations, making rich context systems affordable at scale
Agents eat your system of record — the rigid app was the constraint, not the schema
When agents can clone your entire CRM in seconds and become the real interface, the SaaS product becomes a dumb write endpoint. Data moats evaporate because agents eliminate the rigid app that demanded rigid schemas.
Scaffolding is tech debt against the next model — the bitter lesson applied to product building
Code built to extend model capability 10-20% becomes worthless when the next model ships, making most product scaffolding an ephemeral trade-off rather than a lasting investment
Trust boundaries must be externalized — not held in engineers' heads
Where an agent's behavior is well-understood vs. unknown should be mapped, made auditable, and connected to deployment gates — not left as implicit tribal knowledge
WebMCP turns websites into agent-native interfaces
Chrome's MCP integration lets websites expose structured tools to agents instead of agents scraping and guessing at UI elements
Context layers supersede semantic layers for agent autonomy
Traditional semantic layers handle metric definitions but agents need a superset: canonical entities, identity resolution, tribal knowledge instructions, and governance guidance
Data agent failures stem from missing business context, not SQL generation gaps
The industry initially blamed text-to-SQL capability for data agent failures, but the real blockers are undefined business definitions, ambiguous sources of truth, and missing tribal knowledge
Detect everything, notify selectively — the observability-to-notification ratio determines system trust
Watch every signal but ensure alerts reaching humans always mean something; teams ignore noisy monitors AND noisy agents equally fast
Embeddings measure similarity, not truth — vector databases have a temporal blind spot
Vector search can't resolve contradictions or understand time; 'I love my job' and 'I'm quitting' retrieve with equal confidence
Navigation beats search for knowledge retrieval — let each data source keep its native query interface
Vector similarity search flattens everything into one embedding space, losing native query affordances; better to let SQL be SQL, files be files, and build a routing layer that picks the right source per question type
Permissioned inference is harder than permissioned retrieval — enterprise context graphs need reasoning-level access control
Controlling who sees data is solved; controlling whose history shapes reasoning for others is the unsolved trust layer enterprise context graphs require
PostgreSQL scales further than you think
OpenAI runs ChatGPT on one PostgreSQL primary plus ~50 read replicas handling millions of QPS — no sharding of PostgreSQL itself, just excellent operations
Response UX should match retrieval intelligence
If your system uses semantic search to find results, the display should reflect that intelligence — keyword highlighting on semantic results creates a confusing mismatch
Safety enforcement belongs in tool design, not system prompts
At scale, embedding safety constraints in the tool's API (blocking destructive operations by default) beats relying on behavioral compliance with system prompt instructions
Shadow execution enables safe trace learning — replay write operations without touching production data
By replaying actions that would write to external apps in a shadow path, agents can learn from realistic end-to-end flows without impacting customer data
Tiered retrieval prevents context overload — summaries first, details on demand
Reading category summaries first, then drilling to items, then raw resources only if needed keeps memory retrieval within token budgets
Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance
A fixed wall-clock budget per experiment makes results comparable, normalizes across hardware, and forces agents to optimize for improvement per unit time
Virtual filesystems replace sandboxes for agent navigation — intercept commands instead of provisioning infrastructure
Mintlify's ChromaFs intercepts Unix commands and translates them into database queries, cutting boot time from 46 seconds to 100ms and cost from $70k/year to near-zero
Evaluations must augment trace data in place — divergent copies drift by design
The moment you export traces to a separate eval system, the copy diverges from where annotations run; evals, annotations, and traces should share a single source of truth
Inference capability lowers input fidelity requirements — smart listeners make imprecise input work
When the consumer of input has strong inference ability, the quality bar for that input drops — voice works not because transcription improved, but because the listener got smarter
KV cache hit rate is the most critical metric for production agents
Maintaining stable prompt prefixes and append-only context architecture maximizes cache reuse, dramatically reducing both cost and latency for agentic workflows
Lakebases decouple compute from storage — databases become elastic infrastructure
Third-generation databases separate compute and storage entirely, putting data in open formats on cloud object stores; the database becomes a serverless layer that scales to zero
Latent demand is the strongest product signal — make the thing people already do easier
People will only do things they already do; you can't get them to do a new thing, but you can make their existing behavior frictionless
Reasoning evaporation permanently destroys agent decision chains when the context window closes
An agent's multi-step reasoning exists only in the context window; when the session ends, the output survives but the decision chain — why each step was taken — is gone forever
Stronger models expand the verification gap, not close it
More capable models increase the deployment surface and raise the stakes of failures, making verification infrastructure more valuable rather than less
Teacher-student trace distillation with consensus validation beats single-oracle learning
A single high-reasoning teacher trace isn't reliable enough for enterprise learning; comparing multiple student traces under production constraints with consensus validation produces trustworthy strategies
AI trace data has an indefinite useful lifespan — SaaS observability's 30-day retention model destroys institutional knowledge
Infrastructure metrics expire quickly but AI conversations and reasoning traces gain value over time; 30-day retention windows erase the very data that reveals failure patterns and training signals