Pretraining → Fine-tuning → RLHF
How an LLM is born, taught manners, and learns to follow instructions.
Building a modern chat LLM is a three-stage pipeline. Each stage has a different goal, a different dataset, and a different cost.
Stage 1: Pretraining
The model is dropped into a sea of text — basically the public internet, plus books, code, papers — and trained on the next-token prediction game. That's it. No labels, no human supervision, no notion of "right answer." Just: read everything, predict the next word.
This stage is where the model learns:
- Grammar, syntax, style
- Facts (encoded statistically into its weights)
- Code patterns
- Common-sense relationships
It's also where almost all the cost lives. Pretraining a frontier model costs tens to hundreds of millions of dollars in GPU time. Llama 3.1 405B used 16,000 H100 GPUs for months. The result is a base model — useful, but not chat-shaped. Ask a base model "what is the capital of France?" and it might respond "is a city. The capital of Germany is Berlin." — it just continues the document.
Stage 2: Supervised fine-tuning (SFT)
Now we teach manners. We collect tens of thousands of examples written by humans:
User: What is the capital of France?
Assistant: The capital of France is Paris.
The model fine-tunes on these examples — basically the same next-token training, but on a much smaller, curated corpus of "good behaviour." After SFT, the model can hold a conversation. But it might still be willing to help with harmful requests, or just be unhelpful in subtle ways.
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
This is where the model learns preferences. Annotators look at pairs of model responses and pick which one is better. Those preferences train a reward model — a small neural net that predicts "how good is this response?" Then the LLM is fine-tuned with reinforcement learning to maximise that reward.
The result: a model that's not just capable, but aligned with what users (and the lab) consider helpful, honest, and harmless.
DPO and friends (the modern version)
Old-school RLHF (PPO) is unstable and slow. Modern labs use simpler variants: DPO (Direct Preference Optimisation), ORPO (Odds Ratio Preference Optimisation), KTO. They sidestep the explicit reward model and learn directly from preference pairs. Same outcome, far simpler training.
The takeaway
When you talk to ChatGPT, Claude, or Gemini, you're talking to a model that has been:
- Pretrained on humanity's text
- Fine-tuned on curated examples of good behaviour
- Aligned via preference learning to be helpful
Each layer changes the model's personality. Each layer can be undone or replaced. That's why open-source base models exist — you can do your own SFT and RLHF, on your own data, for your own purposes.