Level 5 · Frontier & Mastery
10 min

Fine-Tuning: LoRA, QLoRA, DPO, ORPO

When prompting isn't enough — change the model itself.

Fine-tuning updates a model's weights on your data. It's the biggest hammer in the AI toolbox — and the most commonly misused. Reach for it only when prompting and RAG have failed.

Full fine-tuning vs LoRA — same outcome, very different cost

Full fine-tuning

All weights updated

  • • 8× GPU memory
  • • Hours to days of training
  • • Catastrophic forgetting risk

LoRA

Base frozen + tiny adapters trained

  • • 1× GPU memory
  • • Minutes to hours
  • • Adapters swappable

LoRA freezes the giant pre-trained model and learns small low-rank adapters (red) — 100× faster, no quality loss.

When fine-tuning is the right answer

Three legit reasons to fine-tune:

  1. Format conformance — your output needs a very specific structure the model can't reliably hit with prompting alone
  2. Domain specialisation — finance, medicine, legal — where vocabulary and style differ enough from general training data
  3. Cost reduction at scale — fine-tuning a 7B to match GPT-4 on your task can save 100× per inference

What it's not good for:

  • Adding knowledge (use RAG)
  • Improving general capability (impossible — you can only shape behaviour)
  • Fixing hallucinations (often makes them worse)

Full fine-tuning vs PEFT

Full fine-tuning updates every weight. Expensive (need 8× the GPU memory of the model), risks catastrophic forgetting, basically nobody does it for LLMs today.

Parameter-Efficient Fine-Tuning (PEFT) updates only a tiny subset:

  • LoRA (Low-Rank Adaptation) — freeze the base model, add small "adapter" matrices (~0.1-1% of params), train only the adapters. Trains in hours on consumer GPUs, no quality loss vs full fine-tuning for most tasks.
  • QLoRA — same as LoRA but base model is 4-bit quantised. Lets you fine-tune 70B models on a single A100.
  • DoRA — slight LoRA improvement, decomposes weight updates into magnitude + direction. Marginally better.

LoRA-based fine-tuning is now the default. The adapters are tiny (10-100 MB), easy to deploy, easy to swap.

Supervised fine-tuning (SFT)

The basic form: collect 500-10,000 input/output pairs of the behaviour you want, train on them with next-token prediction loss. Done in 4-8 hours on a single GPU.

Data is everything. Quality > quantity. 500 well-curated examples beat 50,000 noisy ones. Make sure your data covers edge cases and includes the exact format you want.

Preference fine-tuning (DPO, ORPO, KTO)

What if you don't have "right answers," just preferences between two options? Modern preference methods learn directly from preference pairs:

  • DPO (Direct Preference Optimisation) — given (prompt, chosen, rejected) triples, train the model to prefer chosen. Simple, stable, replaces classical RLHF.
  • ORPO — combines SFT and preference learning in one pass. Faster.
  • KTO — needs only binary "good"/"bad" labels, not pairs.

For most teams: SFT to set behaviour, then DPO to refine on edge cases or alignment goals.

Cost reality

Fine-tuning costs in 2026:

  • LoRA fine-tuning of a 7B on 5,000 examples: ~$5-20 on Together AI or RunPod
  • LoRA on 70B: ~$50-200
  • Hosted fine-tuning APIs (OpenAI, Anthropic, Google): more expensive but turn-key

Compared to prompt engineering's near-zero cost, fine-tuning is a real investment. Make sure your eval system can prove the fine-tune is winning.

The deployment story

After training, you have either:

  • An adapter to load on top of the base model (LoRA) — small, fast to deploy
  • A new fine-tuned checkpoint (full fine-tune) — large

Open-weight models with LoRA adapters can be served with vLLM, Ollama, or any compatible inference engine. Closed models (OpenAI, Anthropic) host your fine-tunes for you at a higher per-token price.

Final advice

Most teams fine-tune too early. Try in order: (1) better prompts, (2) few-shot examples, (3) RAG, (4) prompt caching, (5) THEN consider fine-tuning. By the time you get there, you'll know exactly what you need.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Which of these is NOT a good reason to fine-tune?

Q2.What is LoRA?

Q3.What does DPO replace, and how?

Q4.Recommended order of techniques to try before fine-tuning?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs