How to Choose a Model
There are over 1,000 publicly tracked LLMs. You will never benchmark all of them. Use this five-axis framework instead.
The five axes
| Axis | Question | Where it matters |
|---|---|---|
| Cost | $ per million tokens? | High-volume apps |
| Latency | Tokens/sec, time-to-first-token? | User-facing chat |
| Quality | MMLU / GSM8K / your eval? | Anything quality-sensitive |
| Context | How much text can it hold? | Long docs, RAG |
| Compliance | Where does data go? | Regulated industries |
The shortlist heuristic
Pick 3 candidates, run 5 of your real prompts through each, and rate the outputs 1-5. The winner of your eval is the winner. Public benchmarks are a starting point, not a verdict.
When in doubt
- Need it free, fast, and good? → Llama 3.3 70B via Groq.
- Need long context? → Gemini 1.5 Flash (1M tokens, free).
- Need top-tier reasoning? → DeepSeek V3 or R1 distills.
- Need code? → Qwen 2.5 Coder 32B.
Open the Comparison Lab to test any three of these on your prompts in under a minute.