Level 4 · Production Engineering
9 min

Choosing the Right Model

Frontier ≠ best for your job. The selection framework that actually works.

Picking the wrong model is the most expensive AI mistake there is — both in dollars and in delivery time. Yet most teams default to "use GPT-4" or "use Claude" without thinking. Here's a better framework.

Cost vs Quality — the Pareto frontier of frontier models

Relative cost ($/1M output)QualityLlama 3.3 70B (Groq)Gemini 2.5 FlashDeepSeek V3Claude 4 HaikuGPT-4.1 miniClaude 4 SonnetGemini 2.5 ProGPT-4.1Claude 4 Opuso3

The frontier (dashed) is where quality scales with cost. Models below the frontier are dominated — pick the model on the frontier that matches your quality bar.

The five axes

Every model can be scored on five axes, and your job is to match weights to your use case:

  1. Quality — How accurate, coherent, capable is it? (Benchmarks: MMLU, HumanEval, MATH, IFEval)
  2. Speed — Tokens per second + time-to-first-token (TTFT)
  3. Cost — Input $/1M + output $/1M tokens
  4. Context — How much can it read in one call?
  5. Openness — Open weights vs API-only

A customer support bot needs: medium quality, high speed, low cost. A legal document analyser needs: top quality, slow OK, big context. A code copilot needs: high quality, low latency, code-specialised.

The benchmark trap

Don't trust headline benchmarks blindly. They're useful as a coarse sort, then irrelevant. Real selection requires running your data through candidate models:

  1. Build a golden set of 50–200 real examples of your task
  2. Hand-write the ideal outputs
  3. Run each candidate model on the golden set
  4. Score the outputs (LLM-as-judge, regex match, manual review)
  5. Plot quality vs cost vs latency

Top of the leaderboard often isn't top on your data. A model that's 7th on MMLU might be 1st on your support tickets.

The cost-quality frontier

For most tasks, several models sit on a Pareto frontier: more cost = more quality, with no free lunch. Examples in 2026:

  • Cheap end: Gemini 2.5 Flash (free), Llama 3.3 70B on Groq (free), DeepSeek V3 ($0.27/$1.10 per M)
  • Mid tier: Claude 4.5 Haiku ($0.80/$4), GPT-4.1 mini ($0.40/$1.60)
  • Frontier: Claude 4 Opus ($15/$75), GPT-4.1 ($2/$8), Gemini 2.5 Pro ($1.25/$5)

If a cheap model gets you 90% of the way, the frontier model often isn't worth 50× the cost.

Strategy: model routing

Production systems often use multiple models:

  • Cheap model handles simple cases (classification, simple Q&A)
  • Frontier model handles hard cases (multi-step reasoning, ambiguous queries)
  • A small classifier decides which one to call

Companies like Martian, OpenRouter, and NotDiamond sell routing as a service. Or roll your own with a few-shot classifier.

Practical decision tree

A quick decision tree for picking a default:

  • Task involves code? → Claude 4 Sonnet or Qwen Coder
  • Task involves long documents (>200K tokens)? → Gemini 2.5 Pro
  • Task is multi-step reasoning? → DeepSeek R1 (cheap) or o3 (paid)
  • Task is high-volume simple? → Gemini 2.5 Flash or Llama 3.3 on Groq
  • Task is creative writing? → Claude 4 Sonnet or GPT-4.1
  • Task needs the absolute best? → Claude 4 Opus

Then validate on your golden set before committing.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What's the most reliable way to choose between candidate models?

Q2.What is model routing?

Q3.When is a frontier model NOT worth its 50× cost premium?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs