AI Agents
AI Product Building67 insights in this topic
67 insights
The context window is the fundamental constraint — everything else follows
Every best practice in AI coding (subagents, /clear, focused tasks, specs files) traces back to managing a single scarce resource: context
Autonomous coding loops need small stories and fast feedback to work
The Ralph pattern ships 13 user stories in 1 hour by decomposing into context-window-sized tasks with explicit acceptance criteria and test-based feedback
A mediocre agent inside a strong harness outperforms a stronger agent inside a messy one
The surrounding machinery — metrics, rollback, scoping, observability — determines autonomous system performance more than model capability
Treat AI like a distributed team, not a single assistant
Running 15 parallel Claude streams with specialized roles (writer, reviewer, architect) produces better results than one perfect conversation
Persistent agent memory preserves institutional knowledge that walks out the door with employees
When agents maintain daily changelogs, decision logs, and work preferences, organizational knowledge survives personnel changes
The three-layer AI stack: Memory, Search, Reasoning
The emerging AI product architecture has three layers — Memory (who is this user), Search (find the right information), Reasoning (navigate complex information) — all running on PostgreSQL
Declarative beats imperative when working with agents
Give agents success criteria and watch them go — don't tell them what to do step by step
Agents that store error patterns learn continuously without fine-tuning or retraining
Dash's 'GPU-poor continuous learning' separates validated knowledge from error-driven learnings — five lines of code replaces expensive retraining
Skill graphs enable progressive disclosure for complex domains
Single skill files hit a ceiling — complex domains need interconnected knowledge that agents navigate progressively from index to description to links to sections to full content
Domain-specific skill libraries are the real agent moat, not core infrastructure
An elite team can replicate any agent's tool architecture in months, but accumulated domain workflows (LBO modeling, compliance, bankruptcy) represent years of domain expertise
Structure plus reasoning beats flat similarity for complex domains
Across documents, code, and skills, the same pattern holds: structured knowledge navigated by reasoning outperforms flat indexes searched by similarity
In agent-native architecture, features are prompts — not code
The shift from coding specific functions to describing outcomes that agents achieve by composing atomic tools
Production agents route routine cases through decision trees, reserving humans for complexity
Handle exact matches and known patterns without AI; invoke the model for ambiguity, and route genuinely complex cases to human judgment
Markdown skill files may replace expensive fine-tuning
A SKILL.md file that teaches an agent how to do something specific can match domain-specific fine-tuned models — at zero training cost
Every optimization has a shadow regression — guard commands make the shadow visible
When optimizing metric A, metric B silently degrades unless you run a separate invariant check (a guard) alongside the primary verification
Observability is the missing discipline for agent systems — you can't improve what you can't measure
Agent systems need telemetry (token usage, latency, error rates, cost per task) as a first-class engineering concern, not an afterthought bolted on after production failures
An orchestrator agent that manages other agents solves the parallel coordination problem without human bottleneck
Instead of humans managing AI agents, a meta-agent spawns specialized agents, routes tasks by model strength, and monitors progress — turning agent swarms into autonomous dev teams
Verification is a Red Queen race — optimizing against a fixed eval contaminates it
Eval suites degrade the moment you use them to improve an agent — the agent adapts to the distribution, and the eval stops measuring what it was designed to measure
Evaluate agent tools with real multi-step tasks, not toy single-call examples
Weak evaluation tasks hide tool design flaws — strong tasks require chained calls, ambiguity resolution, and verifiable outcomes
Evolving summaries beat append-only memory — rewrite profiles, don't accumulate facts
An evolve_summary() function that rewrites category profiles with new information handles contradictions naturally, unlike append-only logs
Tool design is continuous observation — see like an agent
Designing effective agent tools requires iterating by watching actual model behavior, not specifying upfront; tools that helped weaker models may constrain stronger ones
The intelligence-to-judgement ratio determines which professions AI automates first
Intelligence work (complex but rule-based) is already automatable; judgement (experience, taste, intuition) remains human — software engineering crossed the threshold first
Multi-model code review creates adversarial robustness — each model catches what others miss
Using 3 different LLMs to review the same PR exploits the fact that models have different failure modes, creating emergent coverage no single model achieves
Parallel agents create a management problem, not a coding problem
When AI agents can work on multiple projects simultaneously, the bottleneck shifts from writing code to coordinating parallel workstreams
Tools are a new kind of software — contracts between deterministic systems and non-deterministic agents
Agent tools must be designed for how agents think (context-limited, non-deterministic, description-dependent), not how programmers think
Rollback safety nets enable autonomous iteration — not model intelligence
The minimum viable safety net for autonomy is a quantifiable metric, atomic changes, and automatic rollback — these make cheap failure possible, which makes aggressive exploration safe
Traces not scores enable agent improvement — without trajectories, improvement rate drops hard
When AutoAgent's meta-agent received only pass/fail scores without reasoning traces, the improvement rate dropped significantly; understanding why matters as much as knowing that
Agent edits are automatic decision instrumentation — every human correction is a structured signal
When agents propose and humans edit, the delta between proposal and correction captures tacit judgment as first-class data without requiring manual logging
Auto-generated narrow monitors beat handwritten broad checks — a tight mesh over the exact shape of the code
1,000+ AI-generated monitors that each target specific code paths catch more bugs than 10 hand-written checks that cover general categories
Context layers must be living systems, not static artifacts
Unlike semantic layers that rot when maintainers leave, context layers need self-updating feedback loops where agent errors refine the context corpus
Cross-user knowledge transfer works without fine-tuning — just a database and prompt engineering
When one person teaches an agent something, another person benefits automatically — no RLHF, no training infrastructure, just structured storage and retrieval
Intelligence location — code vs prompts — determines system fragility and flexibility
Critical architectural fork: prompt-driven systems (Pal's 400-line routing prompt) are flexible but break when models change; code-driven systems (our validate-graph.js) are rigid but reliable — best systems need both
Malleable software — a tiny core that writes its own plugins — replaces fixed-feature applications
Instead of adapting your workflow to the tool, the tool observes your workflow and extends itself to match it
AI is the computer — orchestration across 19 models is the product, not any single model
Perplexity launched a unified agent system orchestrating 19 backend models that delegate tasks, manage files, execute code, and browse the web. The differentiation isn't the models — it's the orchestration. 'The computer is the orchestration system.'
Property-based testing explores agent input spaces that example-based tests miss
Generative tests that produce random or adversarial inputs discover edge cases in agent behavior that hand-written examples never cover — verification over testing means proving properties, not checking cases
Self-improving agents overfit to eval metrics — the meta-agent games rubrics unless structurally constrained
AutoAgent's meta-agent gets lazy, inserting rubric-specific prompting so the task agent can game metrics; defense requires forcing self-reflection on generalizability
Treat an agent as an operating system, not a stateless function
Agents need RAM (conversation context), a hard drive (persistent memory), garbage collection (decay/pruning), and I/O management (tools) — the OS mental model unlocks architectural clarity
Tribal knowledge is the irreducible human input that enables agent automation
Automated context construction handles most of the corpus, but the most critical context is implicit, conditional, and historically contingent — only humans can provide it
Trust boundaries must be externalized — not held in engineers' heads
Where an agent's behavior is well-understood vs. unknown should be mapped, made auditable, and connected to deployment gates — not left as implicit tribal knowledge
WebMCP turns websites into agent-native interfaces
Chrome's MCP integration lets websites expose structured tools to agents instead of agents scraping and guessing at UI elements
Accumulated agent traces produce emergent world models — discovered, not designed
When agent decision trajectories accumulate over time, they form a context graph that reveals entities, relationships, and constraints nobody explicitly modeled
Agent trust transfers from human credibility — colleagues adopt agents operated by people they trust
When a human's agent consistently performs well, other team members inherit that trust and willingly depend on the agent, creating a credibility chain
Compilation scales but curation compounds — two camps for knowledge graph construction
LLM-compiled systems (Karpathy, Pal) grow fast by feeding raw content through model judgment; human-curated systems (our graph, brainctl) grow slowly but every node is validated — compilation scales linearly, curation compounds through connections
Data agent failures stem from missing business context, not SQL generation gaps
The industry initially blamed text-to-SQL capability for data agent failures, but the real blockers are undefined business definitions, ambiguous sources of truth, and missing tribal knowledge
Deputies and Sheriffs — distributed agent teams with hierarchical authority replace centralized software
Individual employees train specialized 'Deputy' agents while organizational 'Sheriff' agents manage permissions, rules, and onboarding across the team
Detect everything, notify selectively — the observability-to-notification ratio determines system trust
Watch every signal but ensure alerts reaching humans always mean something; teams ignore noisy monitors AND noisy agents equally fast
Meta-agents that autonomously optimize task agents beat hand-engineered harnesses on production benchmarks
AutoAgent's meta-agent hit #1 on SpreadsheetBench (96.5%) and TerminalBench (55.1%) by autonomously iterating on a task agent's harness for 24+ hours — every other leaderboard entry was hand-engineered
One session per contract beats long-running agent sessions
Fresh context per task contract outperforms 24-hour agent sessions because cross-contract context bloat degrades performance by construction
Personal software grows through relationship, not configuration
Unlike traditional SaaS where users adapt to the tool, personal software agents grow personality and skills in response to their user through ongoing interaction
Same-model meta-task pairings outperform cross-model — agents understand their own architecture better than humans or other models do
Claude meta-agent + Claude task agent outperformed Claude meta-agent + GPT task agent because the meta-agent shares weights and implicitly understands how the inner model reasons
Safety enforcement belongs in tool design, not system prompts
At scale, embedding safety constraints in the tool's API (blocking destructive operations by default) beats relying on behavioral compliance with system prompt instructions
Shadow execution enables safe trace learning — replay write operations without touching production data
By replaying actions that would write to external apps in a shadow path, agents can learn from realistic end-to-end flows without impacting customer data
A skill's folder structure is its context architecture — the file system is a form of context engineering
Skills are not just markdown files but folders where scripts, references, and assets enable progressive disclosure — the agent reads deeper files only when it reaches the relevant step
Time-bounded evaluation forces optimization for real-world usefulness instead of idealized performance
A fixed wall-clock budget per experiment makes results comparable, normalizes across hardware, and forces agents to optimize for improvement per unit time
Two-tier agent memory separates organizational workflow knowledge from individual user preferences
Deployment-level memory captures shared tool strategies and sequencing patterns; user-level memory captures personal templates and communication styles — initially skipping user-level had a significant performance impact
Vertical models beat frontier models in their domain — specialization wins on every metric
Intercom's Apex, a specialized customer service LLM, beat every frontier model including Anthropic and OpenAI on resolution rate, latency, hallucination rate, and cost
Virtual filesystems replace sandboxes for agent navigation — intercept commands instead of provisioning infrastructure
Mintlify's ChromaFs intercepts Unix commands and translates them into database queries, cutting boot time from 46 seconds to 100ms and cost from $70k/year to near-zero
KV cache hit rate is the most critical metric for production agents
Maintaining stable prompt prefixes and append-only context architecture maximizes cache reuse, dramatically reducing both cost and latency for agentic workflows
Reasoning evaporation permanently destroys agent decision chains when the context window closes
An agent's multi-step reasoning exists only in the context window; when the session ends, the output survives but the decision chain — why each step was taken — is gone forever
Separate research from implementation to preserve agent context for execution
Mixing research and implementation pollutes context with irrelevant alternatives — split them into separate agent sessions so the implementer gets only the chosen approach
Teacher-student trace distillation with consensus validation beats single-oracle learning
A single high-reasoning teacher trace isn't reliable enough for enterprise learning; comparing multiple student traces under production constraints with consensus validation produces trustworthy strategies
AI trace data has an indefinite useful lifespan — SaaS observability's 30-day retention model destroys institutional knowledge
Infrastructure metrics expire quickly but AI conversations and reasoning traces gain value over time; 30-day retention windows erase the very data that reveals failure patterns and training signals
Uncorrelated context windows are a form of test time compute — fresh perspectives multiply capability
Multiple agents with independent context windows avoid polluting each other's reasoning, and throwing more context at a problem from different angles increases capability
Unfocused agents develop path dependency — without a specific mission, they explore the same paths repeatedly
Agents given broad mandates (like 'find bugs') converge on familiar exploration paths, catching high-radius issues but missing narrow situational problems
Weaponize sycophancy with adversarial agent ensembles instead of fighting it
Deploy bug-finder, adversary, and referee agents with scoring incentives that exploit each agent's eagerness to please — triangulating truth from competing biases
Agents need workflow-level tool strategies, not individual tool instructions — the hard part is how tools combine
In enterprise environments, the challenge isn't finding the right tool but understanding how tools work together; intentionally narrow strategies that capture workflow patterns generalize better than broad abstractions
AI's self-improvement loop means each generation builds the next one faster
GPT-5.3-Codex was instrumental in creating itself — recursive improvement compresses timelines and explains why building for obsolescence is the only safe strategy