Evaluation: Stop Vibing, Start Measuring
The single biggest reason AI features fail in production is the lack of evals. "Vibe checking" works for demos and dies in scale.
The eval stack
- A golden dataset — 20-200 representative inputs with expected outputs.
- A grader — code (regex / JSON match), another LLM, or a human.
- A baseline — what's the score today on your current model?
- A regression loop — re-run when you change the prompt or model.
Types of graders
- Exact match — for structured outputs (JSON, code).
- LLM-as-judge — a stronger model rates the response 1-5 with a rubric.
- Embedding similarity — for "is this semantically close enough?"
- Human — slow but the gold standard. Use sparingly to calibrate the others.
Custom benchmarks in LLMAtlas
Upload a CSV of {input, expected} pairs. Pick three candidate models. Run. Get a scored leaderboard. The whole loop takes a few minutes and gives you a defensible answer to "which model should we use?"