Evaluation: From Vibes to Metrics
If you can't measure quality, you can't improve it. Build an eval system before you need it.
"It feels better" is not a shipping criterion. Evals are how you go from vibes to numbers — and they're the difference between teams that iterate confidently and teams that ship regressions.
The eval hierarchy
Three levels, in order of effort and reliability:
- Vibe checks — eyeball 5-10 outputs. Fast, biased, fine for exploration.
- Golden sets — hand-labelled input/output pairs, scored programmatically. Reliable, slow to build.
- Live A/B tests — ship two versions, measure user behaviour. Definitive but slow.
You graduate from one to the next as the stakes grow.
Building a golden set
The cheapest worthwhile eval system:
- Collect 50-200 real inputs from your product
- For each, write or label the ideal output
- Run candidate models / prompts on the inputs
- Score each output against the ideal
Distribution matters. Don't just include happy paths — sample the long tail: edge cases, adversarial inputs, ambiguous queries, multi-turn complications. A golden set without adversarial examples will lull you into false confidence.
Scoring methods
Four common approaches, from cheap to expensive:
- Exact match / regex — works for classification, extraction, JSON conformance. Brittle for free-form text.
- BLEU / ROUGE — n-gram overlap with reference. Cheap, weak for modern generation tasks.
- Embedding similarity — cosine similarity between output embedding and ideal embedding. Better than n-gram but blunt.
- LLM-as-judge — use a strong model to score outputs against criteria. Most flexible, requires careful prompt engineering of the judge.
LLM-as-judge is now the standard. Use a frontier model (Claude 4 Opus, GPT-4.1) as the judge, score on specific rubrics ("Does the output cite a source? 0=no, 1=yes"), and validate the judge against human scores on a sample.
What to measure
Different tasks need different metrics. Some examples:
- Q&A: answer correctness, hallucination rate, citation accuracy
- Summarisation: faithfulness, completeness, conciseness, factual accuracy
- Classification: precision, recall, F1
- Code generation: pass@1 (does generated code pass tests on first try?), compilation rate
- Agents: task completion rate, average steps, cost per task, error recovery rate
Regression eval as CI
Once your golden set exists, run it on every prompt change. Score thresholds become a CI gate: "block deployment if accuracy drops > 2%."
Tools that help: Promptfoo, Braintrust, Phoenix Arize, LangSmith. Or build it yourself — most teams need <500 lines.
The dirty secret
Evals are the single highest-ROI investment in any AI product, and the most consistently neglected. Teams ship for months on vibes, then can't explain why their metrics dropped. The teams that win build evals on day one. You don't have an AI product until you have an eval set.