Reasoning Models
DeepSeek R1, o3, QwQ — a new class of model that thinks before it speaks.
Throughout 2024-2025 a new category of model emerged that broke the old scaling laws. Instead of being made bigger, reasoning models were trained to think longer. The result: dramatic gains on math, code, and multi-step problems — sometimes 30+ percentage points over their base models.
Capability vs compute — and the regime shift with reasoning models
Old laws: more compute → smooth capability gains. New regime: training on RL with verifiable rewards unlocks step-changes.
What makes a reasoning model different
A reasoning model is post-trained with reinforcement learning on verifiable rewards. The training loop:
- Give the model a math/code problem with a known answer
- Let it generate a long chain of thought + final answer
- If the answer is correct, reward the chain
- Repeat for millions of problems
The model learns to search and verify within its own context window. It generates 5,000-30,000 tokens of "thinking" — backtracking, checking work, exploring alternatives — before emitting a final answer.
The visible chain-of-thought
These models expose their reasoning. DeepSeek R1 emits a <think>...</think> block with its scratchwork before answering. o3 returns reasoning summaries via the API. QwQ does the same. You can watch the model reason.
Sometimes this is fascinating ("ah, I made an error, let me reconsider..."). Sometimes it's embarrassing ("the user is asking about X but I'll pretend to know..."). Either way, it's a new layer of observability.
When reasoning models win
They dominate when:
- The task has a verifiable answer (math, code, logic puzzles)
- Multi-step reasoning is required
- The problem can be decomposed
They tie or lose when:
- The task is creative writing
- Speed matters (they're 10-100× slower)
- The task is simple Q&A or summary
Cost and latency trade-offs
A reasoning model spends 5,000-30,000 tokens of internal thought. At $15/M tokens that's $0.45 per query. Compare to GPT-4.1 at $0.005 for the same query. Reasoning models are 70-100× more expensive per request.
Latency: 30-90 seconds typical for hard problems vs 2-5 seconds for standard models. Not a chat UX — more like an async tool you queue work for.
The major reasoning models in 2026
- OpenAI o3 — best overall, expensive ($60/M output). Multimodal.
- OpenAI o4-mini — 90% of o3 capability, 10% of the cost.
- DeepSeek R1 (0528 refresh) — open weights, free via OpenRouter, frontier-tier on math/code.
- Google Gemini 2.5 Pro Thinking — strong, generous free tier.
- Alibaba QwQ-32B — open weights, strong reasoning at small scale.
- Anthropic Claude 4 Opus (extended thinking) — toggle reasoning mode on Claude.
When to use them in production
Reasoning models go in your expensive lane: hard customer questions that need a real answer, code generation for non-trivial tasks, math/finance/science workflows. Use a routing classifier to send only the hard 10% of queries to a reasoning model; the easy 90% go to a fast cheap model.
The trick: spend reasoning model compute only where it matters.