Architecture
AI Product Building90 insights in this topic
90 insights
Context is the product, not the model
Anyone can call the API — differentiation comes from the data you access, skills you build, UX you design, and domain knowledge you encode
Decision traces are the missing data layer — a trillion-dollar gap
Systems store what happened but not why; capturing the reasoning behind decisions creates searchable precedent and a new system of record
A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one
The surrounding machinery — metrics, rollback, scoping, observability — determines autonomous system performance more than model capability
Files are the universal interface between humans and agents
Markdown and YAML files on disk beat databases because agents already know file operations and humans can inspect everything
Production agents route routine cases through decision trees, reserving humans for complexity
Handle exact matches and known patterns without AI; invoke the model for ambiguity, and route genuinely complex cases to human judgment
Scaffolding is tech debt against the next model — the bitter lesson applied to product building
Code built to extend model capability 10-20% becomes worthless when the next model ships, making most product scaffolding an ephemeral trade-off rather than a lasting investment
Agents that store error patterns learn continuously without fine-tuning or retraining
Dash's 'GPU-poor continuous learning' separates validated knowledge from error-driven learnings — five lines of code replaces expensive retraining
The three-layer AI stack: Memory, Search, Reasoning
The emerging AI product architecture has three layers — Memory (who is this user), Search (find the right information), Reasoning (navigate complex information) — all running on PostgreSQL
Verification is a Red Queen race — optimizing against a fixed eval contaminates it
Eval suites degrade the moment you use them to improve an agent — the agent adapts to the distribution, and the eval stops measuring what it was designed to measure
The context flywheel is a Day 90 moat — Day 0 comparisons are misleading
Point-in-time capability benchmarks miss the compounding advantage: on Day 0 a raw model matches your product, but by Day 90 accumulated context creates an unbridgeable gap
Observability is the missing discipline for agent systems — you can't improve what you can't measure
Agent systems need telemetry (token usage, latency, error rates, cost per task) as a first-class engineering concern, not an afterthought bolted on after production failures
Markdown skill files may replace expensive fine-tuning
A SKILL.md file that teaches an agent how to do something specific can match domain-specific fine-tuned models — at zero training cost
Structure plus reasoning beats flat similarity for complex domains
Across documents, code, and skills, the same pattern holds: structured knowledge navigated by reasoning outperforms flat indexes searched by similarity
Evals are behavioral pressure vectors, not neutral measurements — poorly chosen evals distort agent development
Each eval shapes agent behavior like a selection pressure; accumulating tests without strategic purpose creates 'an illusion of improving your agent' while distorting development in unproductive directions, and correctness alone misleads because agents that succeed inefficiently create hidden cost
In agent-native architecture, features are prompts — not code
The shift from coding specific functions to describing outcomes that agents achieve by composing atomic tools
Revealed preferences trump stated preferences — track what users do, not what they say
Users' actual behavior (what they click, skip, edit, redo) is the ground truth for product decisions; stated preferences in surveys and interviews systematically mislead
Agentic search beats RAG for live codebases
Claude Code abandoned RAG and vector DB in favor of letting the agent grep/glob/read — reasoning about where to look outperforms pre-indexed similarity search for code
Similarity is not relevance — relevance requires reasoning
Vector search finds semantically similar content, but what users need is relevant content, and determining relevance requires LLM reasoning, not just pattern matching
Safety enforcement belongs in tool design, not system prompts
At scale, embedding safety constraints in the tool's API (blocking destructive operations by default) beats relying on behavioral compliance with system prompt instructions
Detect everything, notify selectively — the observability-to-notification ratio determines system trust
Watch every signal but ensure alerts reaching humans always mean something; teams ignore noisy monitors AND noisy agents equally fast
Evals are the gradient signal for harness engineering — the same data quality rigor from ML training applies
The analogy between ML training and agent development is structural: evals encode desired behavior like training data encodes ground truth, and the same principles (data quality, curation, train/test splits) determine outcomes
Inference-time compute makes cost-per-outcome a choice — and that's the application layer's counterattack on the labs
No prior software had a dial where 10x more compute buys a better answer; a 10-second and a 10-minute query on the same model are different products at different prices. Margin depends on the system's judgment of where to spend tokens, not on model pricing — the lab wants to expand usage, the application wants to spend only where the outcome is worth it
AI is the computer — orchestration across 19 models is the product, not any single model
Perplexity launched a unified agent system orchestrating 19 backend models that delegate tasks, manage files, execute code, and browse the web. The differentiation isn't the models — it's the orchestration. 'The computer is the orchestration system.'
Open harnesses with customer-owned databases are the antidote to model-provider lock-in
An open, model-agnostic harness that stores memory in a database you control (Postgres, Mongo, Redis) keeps both model choice and memory portable
The trace→eval→harness flywheel compounds agent quality — every production interaction generates its own training data
Production traces where agents fail become eval cases; better evals improve the harness; better harnesses produce better traces — creating a self-reinforcing improvement loop
Traces not scores enable agent improvement — without trajectories, improvement rate drops hard
When AutoAgent's meta-agent received only pass/fail scores without reasoning traces, the improvement rate dropped significantly; understanding why matters as much as knowing that
Boring tech wins for AI-native startups — simpler stack means faster AI-assisted shipping
React + Node + TypeScript + Postgres + Redis scales to $1M ARR with 3 engineers; monorepo is a superpower for AI coding assistants
Sand vs Stone — if models double in capability tomorrow, what washes away and what remains?
Framework for evaluating AI product durability: context flywheels and domain expertise are stone; model workarounds and clever engineering are sand
Sessions are runtime infrastructure, not just resumable transcripts
Hermes stores sessions in SQLite with search and lineage so CLI, messaging platforms, and scheduled jobs all attach to one session plane — routing can resolve before the model even runs
The learning loop becomes the firm's new IP — a hill-climbing machine that compounds unlike any other asset
Every improved workflow generates better training signal, which accelerates the accumulation of tacit knowledge unique to the firm; companies that build this loop early gain an advantage that's hard to replicate regardless of any new model capability
The UI moat collapses — API quality becomes the purchasing criterion
When agents are the primary users of software, beautiful dashboards stop mattering and API design becomes the competitive surface
Agent edits are automatic decision instrumentation — every human correction is a structured signal
When agents propose and humans edit, the delta between proposal and correction captures tacit judgment as first-class data without requiring manual logging
AI is steel for organizations — when software carries the context, human communication stops being the load-bearing wall
Before steel, buildings capped at six or seven floors because iron buckled under its own weight; AI that maintains context across workflows removes human communication (meetings, messages) as the structure that caps how far an org can scale before it degrades
Context inefficiency compounds three penalties: cost, latency, and quality degradation
Every wasted token in an LLM context window doesn't just cost money — it slows responses and degrades output quality, creating a triple tax on production agents
Context layers must be living systems, not static artifacts
Unlike semantic layers that rot when maintainers leave, context layers need self-updating feedback loops where agent errors refine the context corpus
Policy enforcement must run independently of model cooperation — hooks, not prompt instructions
Hermes runs lifecycle hooks that block, rewrite, or audit operations at fixed events, so policy and side-effects never depend on the model choosing to comply
Evolved harnesses transfer across models — a single optimized harness improves five different LLMs
Meta-Harness discovered a retrieval harness that improved math reasoning by 4.7 percentage points average across five held-out models it was never optimized for, suggesting harness quality is model-agnostic
Intelligence location — code vs prompts — determines system fragility and flexibility
Critical architectural fork: prompt-driven systems (Pal's 400-line routing prompt) are flexible but break when models change; code-driven systems (our validate-graph.js) are rigid but reliable — best systems need both
Knowledge evolution is the biggest unsolved problem across all graph architectures
Almost nobody has solved how knowledge graphs grow without rotting — most are append-only, auto-decay is too aggressive, and even the best systems only add links without pruning, merging, or detecting contradictions
KV cache hit rate is the most critical metric for production agents
Maintaining stable prompt prefixes and append-only context architecture maximizes cache reuse, dramatically reducing both cost and latency for agentic workflows
Routing across the whole model market — and absorbing every migration — is a defense the labs can't copy
A vertical company picks the best model per sub-task across all vendors, absorbs eval/migration work on every upgrade, and sells the lowest cost for the exact intelligence each step needs
Permissioned inference is harder than permissioned retrieval — enterprise context graphs need reasoning-level access control
Controlling who sees data is solved; controlling whose history shapes reasoning for others is the unsolved trust layer enterprise context graphs require
Prompt caching makes long context economically viable
Prefix-matching cache enables 80%+ cost reduction for multi-turn conversations, making rich context systems affordable at scale
The gains come from redesigning work around AI, not bolting AI onto human workflows
Like factory owners who first swapped waterwheels for steam engines and changed nothing else (modest gains), today's orgs bolt chatbots onto human-designed workflows — the explosion comes only when the work is redesigned around agents
A loss curve is reassurance, not analysis — pull a hundred failures and read every one
Experiments throw off far more information than you consume — transcripts, failure cases, the strange tail — and most of it dies unread. Most ML bugs live in the data and fail silently; Ng's move is to pull 100 failures, sort them into piles, and attack the biggest pile
Traces are the universal substrate for agent learning — all three layers consume the same execution logs
Whether updating model weights, improving harness code, or refining context/memory, agent learning flows start from the same raw material: traces capturing the full execution path of what an agent did
Agent harnesses are persistent infrastructure, not scaffolding models will absorb
As models improve, old scaffolding disappears but new scaffolding replaces it — harnesses aren't going away, they're evolving
Auto-generated narrow monitors beat handwritten broad checks — a tight mesh over the exact shape of the code
1,000+ AI-generated monitors that each target specific code paths catch more bugs than 10 hand-written checks that cover general categories
Compression should be a forking lifecycle event, not a destructive rewrite
Instead of repeatedly overwriting one transcript, Hermes seeds a child session from each summary and records parent-child lineage — producing an auditable chain of compressions
Context centralization is why coding AI works — git is a solved context repository, knowledge work has no equivalent
Engineering AI leads because git centralizes all context in one versioned repository; knowledge work fails on three axes: distributed, unstructured, unverifiable
Guardrails aren't just safety — they're what the customer is paying for
Per-use-case, per-customer, continuously-audited governance is the product in a regulated vertical; becoming the compliance control plane is a moat a horizontal player can't credibly hold
Hybrid search is the default, not the exception
Neither keyword nor semantic search alone is complete — combining BM25 and vector search with reranking is the baseline for production systems
Knowledge systems need dual-layer storage — narrative depth and structured queries can't share a format
Every system beyond 'markdown files in a folder' discovers that narrative depth (rich prose, context, reasoning) and structured querying (filter, aggregate, cross-reference) need different storage layers with a routing mechanism between them
Memory is a harness responsibility, not a pluggable component
Managing context — what enters, what survives compaction, what's queryable — is a core capability of the harness itself, not an add-on service
Model compensations become liabilities as capabilities advance — yesterday's fixes hobble today's agent
Engineering workarounds for earlier model limitations accumulate as technical debt that actively degrades agent performance when models improve
Metadata consumed by LLMs needs trigger specifications, not human summaries
When an LLM scans metadata to decide what to invoke, the description should specify when to activate — not summarize what the thing does — because LLMs are a fundamentally different consumer than humans
The sovereignty test — can you swap out a generalist model without losing your 'company veteran' expertise?
A firm controls its IP only if it can switch the underlying generalist model while keeping the company-veteran expertise built into its learning system; that portability is the test of control and sovereignty in the AI era
Private evals should measure business outcomes that matter — not external benchmarks
A firm's learning loop runs on private evals tied to real business outcomes and private RL environments trained on internal traces, so the model improves against what the company cares about rather than public leaderboards
Reasoning evaporation permanently destroys agent decision chains when the context window closes
An agent's multi-step reasoning exists only in the context window; when the session ends, the output survives but the decision chain — why each step was taken — is gone forever
Agents eat your system of record — the rigid app was the constraint, not the schema
When agents can clone your entire CRM in seconds and become the real interface, the SaaS product becomes a dumb write endpoint. Data moats evaporate because agents eliminate the rigid app that demanded rigid schemas.
Separate tool registration from tool exposure — install broadly, reveal narrowly
Hermes registers all tools into a central registry at import time but a separate layer decides what each run actually shows the model, scoped by platform and scenario
Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance
A fixed wall-clock budget per experiment makes results comparable, normalizes across hardware, and forces agents to optimize for improvement per unit time
Trust boundaries must be externalized — not held in engineers' heads
Where an agent's behavior is well-understood vs. unknown should be mapped, made auditable, and connected to deployment gates — not left as implicit tribal knowledge
Unattended agent jobs must run through the same permission machinery as interactive sessions
Hermes makes cron a first-class subsystem — scheduled jobs are gated by the same permissions, delivered through the same paths, and isolated per profile, instead of living as peripheral scripts
The data flywheel is a UX problem — only vertical workflow surfaces can capture the knowledge
Two stacked flywheels (across-customer pattern recognition + within-customer tacit rules) accrue only through workflow-specific capture surfaces that horizontal tools structurally cannot shape
WebMCP turns websites into agent-native interfaces
Chrome's MCP integration lets websites expose structured tools to agents instead of agents scraping and guessing at UI elements
Causal triage must gate automated fixes — statistical regression detection alone can't distinguish your bugs from external failures
Raw error-rate spikes after deployment can't tell you whether YOUR code broke or a third-party API went down; a triage agent that establishes causal links between code changes and observed errors must gate any automated fixing agent
Context layers supersede semantic layers for agent autonomy
Traditional semantic layers handle metric definitions but agents need a superset: canonical entities, identity resolution, tribal knowledge instructions, and governance guidance
Data agent failures stem from missing business context, not SQL generation gaps
The industry initially blamed text-to-SQL capability for data agent failures, but the real blockers are undefined business definitions, ambiguous sources of truth, and missing tribal knowledge
Embeddings measure similarity, not truth — vector databases have a temporal blind spot
Vector search can't resolve contradictions or understand time; 'I love my job' and 'I'm quitting' retrieve with equal confidence
Navigation beats search for knowledge retrieval — let each data source keep its native query interface
Vector similarity search flattens everything into one embedding space, losing native query affordances; better to let SQL be SQL, files be files, and build a routing layer that picks the right source per question type
Order the system prompt by volatility to keep prompt prefixes cache-friendly
Hermes composes the system prompt in three tiers — stable, context, volatile — so the unchanging prefix stays cacheable while turn-by-turn data lives at the end
PostgreSQL scales further than you think
OpenAI runs ChatGPT on one PostgreSQL primary plus ~50 read replicas handling millions of QPS — no sharding of PostgreSQL itself, just excellent operations
Response UX should match retrieval intelligence
If your system uses semantic search to find results, the display should reflect that intelligence — keyword highlighting on semantic results creates a confusing mismatch
Shadow execution enables safe trace learning — replay write operations without touching production data
By replaying actions that would write to external apps in a shadow path, agents can learn from realistic end-to-end flows without impacting customer data
Tiered retrieval prevents context overload — summaries first, details on demand
Reading category summaries first, then drilling to items, then raw resources only if needed keeps memory retrieval within token budgets
AI trace data has an indefinite useful lifespan — SaaS observability's 30-day retention model destroys institutional knowledge
Infrastructure metrics expire quickly but AI conversations and reasoning traces gain value over time; 30-day retention windows erase the very data that reveals failure patterns and training signals
Traces replace code as the source of truth for agent systems — debugging shifts from 'show me the code' to 'send me the trace'
In agent systems, execution traces replace source code as the primary debugging and collaboration artifact — you can't predict step 14's context from reading the code
Virtual filesystems replace sandboxes for agent navigation — intercept commands instead of provisioning infrastructure
Mintlify's ChromaFs intercepts Unix commands and translates them into database queries, cutting boot time from 46 seconds to 100ms and cost from $70k/year to near-zero
Where inference runs decides who captures margin, owns the context, and earns trust
Value won't all accrue to the cloud — inference moves to wherever it's cheapest without breaking the product: cloud for frontier reasoning, edge for latency, on-device for privacy. Privacy matters more than in SaaS because the model isn't just storing data, it's reasoning over the user's context, memory, code, and permissions
Agentic UX (AUX) is a distinct design problem — agents don't want to use software the way humans do
AUX (Agentic User Experience) is neither human UX adapted for agents nor raw APIs — it's a third design discipline for how agents want to consume software
Evaluations must augment trace data in place — divergent copies drift by design
The moment you export traces to a separate eval system, the copy diverges from where annotations run; evals, annotations, and traces should share a single source of truth
Inference capability lowers input fidelity requirements — smart listeners make imprecise input work
When the consumer of input has strong inference ability, the quality bar for that input drops — voice works not because transcription improved, but because the listener got smarter
Latent demand is the strongest product signal — make the thing people already do easier
People will only do things they already do; you can't get them to do a new thing, but you can make their existing behavior frictionless
Lakebases decouple compute from storage — databases become elastic infrastructure
Third-generation databases separate compute and storage entirely, putting data in open formats on cloud object stores; the database becomes a serverless layer that scales to zero
LLM-as-judge must be calibrated against human judgment — uncalibrated judges are worse than no judges
An LLM judge without human-labeled calibration data produces false confidence; the bridge is humans labeling traces, then training the judge to replicate those labels
Long-horizon evals test compounding behavior, not point-in-time accuracy
Hex's Metric City benchmark simulates 90 days of agent use with evolving data to measure whether the agent gets smarter over time — Day 0: 4%, Day 90: 24%
Stronger models expand the verification gap, not close it
More capable models increase the deployment surface and raise the stakes of failures, making verification infrastructure more valuable rather than less
Teacher-student trace distillation with consensus validation beats single-oracle learning
A single high-reasoning teacher trace isn't reliable enough for enterprise learning; comparing multiple student traces under production constraints with consensus validation produces trustworthy strategies
Conflicting context causes agent collapse, not graceful degradation
When an LLM encounters contradictory information in its context, it enters extended deliberation loops rather than choosing one interpretation — production finding from Hex