LLMAtlas — The Open Ecosystem Workspace for LLMs

"It feels better" is not a shipping criterion. Evals are how you go from vibes to numbers — and they're the difference between teams that iterate confidently and teams that ship regressions.

The eval hierarchy

Three levels, in order of effort and reliability:

Vibe checks — eyeball 5-10 outputs. Fast, biased, fine for exploration.
Golden sets — hand-labelled input/output pairs, scored programmatically. Reliable, slow to build.
Live A/B tests — ship two versions, measure user behaviour. Definitive but slow.

You graduate from one to the next as the stakes grow.

Building a golden set

The cheapest worthwhile eval system:

Collect 50-200 real inputs from your product
For each, write or label the ideal output
Run candidate models / prompts on the inputs
Score each output against the ideal

Distribution matters. Don't just include happy paths — sample the long tail: edge cases, adversarial inputs, ambiguous queries, multi-turn complications. A golden set without adversarial examples will lull you into false confidence.

Scoring methods

Four common approaches, from cheap to expensive:

Exact match / regex — works for classification, extraction, JSON conformance. Brittle for free-form text.
BLEU / ROUGE — n-gram overlap with reference. Cheap, weak for modern generation tasks.
Embedding similarity — cosine similarity between output embedding and ideal embedding. Better than n-gram but blunt.
LLM-as-judge — use a strong model to score outputs against criteria. Most flexible, requires careful prompt engineering of the judge.

LLM-as-judge is now the standard. Use a frontier model (Claude 4 Opus, GPT-4.1) as the judge, score on specific rubrics ("Does the output cite a source? 0=no, 1=yes"), and validate the judge against human scores on a sample.

What to measure

Different tasks need different metrics. Some examples:

Q&A: answer correctness, hallucination rate, citation accuracy
Summarisation: faithfulness, completeness, conciseness, factual accuracy
Classification: precision, recall, F1
Code generation: pass@1 (does generated code pass tests on first try?), compilation rate
Agents: task completion rate, average steps, cost per task, error recovery rate

Regression eval as CI

Once your golden set exists, run it on every prompt change. Score thresholds become a CI gate: "block deployment if accuracy drops > 2%."

Tools that help: Promptfoo, Braintrust, Phoenix Arize, LangSmith. Or build it yourself — most teams need <500 lines.

The dirty secret

Evals are the single highest-ROI investment in any AI product, and the most consistently neglected. Teams ship for months on vibes, then can't explain why their metrics dropped. The teams that win build evals on day one. You don't have an AI product until you have an eval set.