Learn/Foundations/Chapter 4
Level 1 · Foundations
9 min

Pretraining → Fine-tuning → RLHF

How an LLM is born, taught manners, and learns to follow instructions.

Building a modern chat LLM is a three-stage pipeline. Each stage has a different goal, a different dataset, and a different cost.

Stage 1: Pretraining

The model is dropped into a sea of text — basically the public internet, plus books, code, papers — and trained on the next-token prediction game. That's it. No labels, no human supervision, no notion of "right answer." Just: read everything, predict the next word.

This stage is where the model learns:

  • Grammar, syntax, style
  • Facts (encoded statistically into its weights)
  • Code patterns
  • Common-sense relationships

It's also where almost all the cost lives. Pretraining a frontier model costs tens to hundreds of millions of dollars in GPU time. Llama 3.1 405B used 16,000 H100 GPUs for months. The result is a base model — useful, but not chat-shaped. Ask a base model "what is the capital of France?" and it might respond "is a city. The capital of Germany is Berlin." — it just continues the document.

Stage 2: Supervised fine-tuning (SFT)

Now we teach manners. We collect tens of thousands of examples written by humans:

User: What is the capital of France?
Assistant: The capital of France is Paris.

The model fine-tunes on these examples — basically the same next-token training, but on a much smaller, curated corpus of "good behaviour." After SFT, the model can hold a conversation. But it might still be willing to help with harmful requests, or just be unhelpful in subtle ways.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

This is where the model learns preferences. Annotators look at pairs of model responses and pick which one is better. Those preferences train a reward model — a small neural net that predicts "how good is this response?" Then the LLM is fine-tuned with reinforcement learning to maximise that reward.

The result: a model that's not just capable, but aligned with what users (and the lab) consider helpful, honest, and harmless.

DPO and friends (the modern version)

Old-school RLHF (PPO) is unstable and slow. Modern labs use simpler variants: DPO (Direct Preference Optimisation), ORPO (Odds Ratio Preference Optimisation), KTO. They sidestep the explicit reward model and learn directly from preference pairs. Same outcome, far simpler training.

The takeaway

When you talk to ChatGPT, Claude, or Gemini, you're talking to a model that has been:

  1. Pretrained on humanity's text
  2. Fine-tuned on curated examples of good behaviour
  3. Aligned via preference learning to be helpful

Each layer changes the model's personality. Each layer can be undone or replaced. That's why open-source base models exist — you can do your own SFT and RLHF, on your own data, for your own purposes.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Which stage of training accounts for almost all the GPU cost?

Q2.Why is a raw pretrained 'base model' not directly usable as a chatbot?

Q3.What does RLHF actually optimise?

Q4.Why are DPO/ORPO replacing classical RLHF (PPO)?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs