LLMAtlas — The Open Ecosystem Workspace for LLMs

After the model computes the next-token probability distribution, sampling picks which token to actually emit. Three parameters control how this happens. Most people set them wrong.

Temperature

Temperature reshapes the probability distribution before sampling:

Temperature = 0 — Always pick the highest-probability token. Deterministic (mostly), often repetitive.
Temperature = 1 — Sample from the raw distribution. Creative, varied, occasionally wrong.
Temperature = 2 — Flattens the distribution; low-probability tokens become more likely. Chaotic.

Mathematically: probability_i = softmax(logit_i / T). Lower T sharpens; higher T smoothens.

Rules of thumb:

Factual Q&A, code, structured output → 0 to 0.3
Drafting prose, brainstorming → 0.6 to 0.9
Creative fiction, poetry → 0.9 to 1.2
Above 1.3 → usually breakdowns

Top-p (nucleus sampling)

Top-p limits sampling to the smallest set of tokens whose cumulative probability exceeds p:

top_p = 1.0 — Consider all tokens (no filter).
top_p = 0.9 — Consider only the top tokens that together make up 90% of probability mass.
top_p = 0.1 — Very aggressive filter; almost greedy.

Top-p adapts to the distribution shape: when the model is confident, only a few tokens qualify; when it's uncertain, more do. It's smarter than top-k.

Top-k

Top-k caps consideration to the k most likely tokens:

top_k = 1 — Greedy. Always pick the most likely.
top_k = 50 — Default for many APIs. Reasonable.
top_k = 0 — Usually means "disabled" — no filter.

Top-k is a hard cutoff; top-p is adaptive. Top-p is generally preferred.

How they combine

When you set multiple sampling parameters, they're applied in order: top-k filters first, then top-p, then temperature scales what's left, then a sample is drawn.

In practice, set one or the other, not both aggressively. A common safe default:

temperature: 0.7
top_p: 0.95
top_k: 0 (disabled)

When determinism matters

For evals, regression tests, and reproducible workflows, set temperature = 0. Even then, true determinism isn't guaranteed — GPU non-determinism, KV cache differences, and batching can introduce small variations. For strict reproducibility, also pin the model version, the seed (if supported), and the API revision.

What max_tokens actually does

max_tokens is the output length cap, measured in generated tokens. It doesn't make the model think harder or longer about the answer; it just sets a ceiling on how much it can produce before being cut off. Set it generously enough to hold a complete answer, tightly enough to prevent runaway generation.