Temperature, top-p, top-k Explained
Three knobs control randomness. Knowing them stops you guessing.
After the model computes the next-token probability distribution, sampling picks which token to actually emit. Three parameters control how this happens. Most people set them wrong.
Temperature
Temperature reshapes the probability distribution before sampling:
- Temperature = 0 — Always pick the highest-probability token. Deterministic (mostly), often repetitive.
- Temperature = 1 — Sample from the raw distribution. Creative, varied, occasionally wrong.
- Temperature = 2 — Flattens the distribution; low-probability tokens become more likely. Chaotic.
Mathematically: probability_i = softmax(logit_i / T). Lower T sharpens; higher T smoothens.
Rules of thumb:
- Factual Q&A, code, structured output → 0 to 0.3
- Drafting prose, brainstorming → 0.6 to 0.9
- Creative fiction, poetry → 0.9 to 1.2
- Above 1.3 → usually breakdowns
Top-p (nucleus sampling)
Top-p limits sampling to the smallest set of tokens whose cumulative probability exceeds p:
- top_p = 1.0 — Consider all tokens (no filter).
- top_p = 0.9 — Consider only the top tokens that together make up 90% of probability mass.
- top_p = 0.1 — Very aggressive filter; almost greedy.
Top-p adapts to the distribution shape: when the model is confident, only a few tokens qualify; when it's uncertain, more do. It's smarter than top-k.
Top-k
Top-k caps consideration to the k most likely tokens:
- top_k = 1 — Greedy. Always pick the most likely.
- top_k = 50 — Default for many APIs. Reasonable.
- top_k = 0 — Usually means "disabled" — no filter.
Top-k is a hard cutoff; top-p is adaptive. Top-p is generally preferred.
How they combine
When you set multiple sampling parameters, they're applied in order: top-k filters first, then top-p, then temperature scales what's left, then a sample is drawn.
In practice, set one or the other, not both aggressively. A common safe default:
temperature: 0.7
top_p: 0.95
top_k: 0 (disabled)
When determinism matters
For evals, regression tests, and reproducible workflows, set temperature = 0. Even then, true determinism isn't guaranteed — GPU non-determinism, KV cache differences, and batching can introduce small variations. For strict reproducibility, also pin the model version, the seed (if supported), and the API revision.
What max_tokens actually does
max_tokens is the output length cap, measured in generated tokens. It doesn't make the model think harder or longer about the answer; it just sets a ceiling on how much it can produce before being cut off. Set it generously enough to hold a complete answer, tightly enough to prevent runaway generation.