AI Product Building Architecture Knowledge Systems

A loss curve is reassurance, not analysis — pull a hundred failures and read every one

Experiments throw off far more information than you consume — transcripts, failure cases, the strange tail — and most of it dies unread. Most ML bugs live in the data and fail silently; Ng's move is to pull 100 failures, sort them into piles, and attack the biggest pile

@itsreallyvivek (vivek) — how to be good at research · Jun 15, 2026 · 6 connections

Vivek’s warning: “a descending loss curve is not analysis, it’s reassurance. your experiments throw off far more information than you consume: transcripts, failure cases, the strange tail of the distribution. most of it dies unread in a logs folder.” Karpathy’s recipe “starts before any training code gets written, with hours spent on the raw data by hand,” because “most ml bugs live in the data, and they fail silently. nothing crashes. you simply get a mediocre model and a wrong theory about why.” Andrew Ng’s decade-old move still wins: “pull a hundred failures, read all of them, sort them into piles, attack the biggest pile” — and it applies to evals too, where “a benchmark you’ve never read transcripts from is a benchmark you don’t actually understand.”

This is the research-craft root of why Traces replace code as the source of truth for agent systems — debugging shifts from 'show me the code' to 'send me the trace' — the transcript, not the aggregate metric, is where understanding lives, and reading the strange tail is the actual analysis. It’s the human discipline that Observability is the missing discipline for agent systems — you can't improve what you can't measure tries to systematize. The “benchmark you’ve never read transcripts from” line is exactly why LLM-as-judge must be calibrated against human judgment — uncalibrated judges are worse than no judges — a score you haven’t traced back to behavior is unanchored. And it generalizes Revealed preferences trump stated preferences — track what users do, not what they say: the failure cases reveal what your system actually does, versus what the loss curve says it does. Aggregate similarity hides this, which is why Similarity is not relevance — relevance requires reasoning.

Connected Insights

References (5)

→ LLM-as-judge must be calibrated against human judgment — uncalibrated judges are worse than no judges → Observability is the missing discipline for agent systems — you can't improve what you can't measure → Revealed preferences trump stated preferences — track what users do, not what they say → Similarity is not relevance — relevance requires reasoning → Traces replace code as the source of truth for agent systems — debugging shifts from 'show me the code' to 'send me the trace'

Referenced by (1)

← Verification is the single highest-leverage practice for agent-assisted coding