Fine-Tuning: LoRA, QLoRA, DPO, ORPO
When prompting isn't enough — change the model itself.
Fine-tuning updates a model's weights on your data. It's the biggest hammer in the AI toolbox — and the most commonly misused. Reach for it only when prompting and RAG have failed.
Full fine-tuning vs LoRA — same outcome, very different cost
Full fine-tuning
All weights updated
- • 8× GPU memory
- • Hours to days of training
- • Catastrophic forgetting risk
LoRA
Base frozen + tiny adapters trained
- • 1× GPU memory
- • Minutes to hours
- • Adapters swappable
LoRA freezes the giant pre-trained model and learns small low-rank adapters (red) — 100× faster, no quality loss.
When fine-tuning is the right answer
Three legit reasons to fine-tune:
- Format conformance — your output needs a very specific structure the model can't reliably hit with prompting alone
- Domain specialisation — finance, medicine, legal — where vocabulary and style differ enough from general training data
- Cost reduction at scale — fine-tuning a 7B to match GPT-4 on your task can save 100× per inference
What it's not good for:
- Adding knowledge (use RAG)
- Improving general capability (impossible — you can only shape behaviour)
- Fixing hallucinations (often makes them worse)
Full fine-tuning vs PEFT
Full fine-tuning updates every weight. Expensive (need 8× the GPU memory of the model), risks catastrophic forgetting, basically nobody does it for LLMs today.
Parameter-Efficient Fine-Tuning (PEFT) updates only a tiny subset:
- LoRA (Low-Rank Adaptation) — freeze the base model, add small "adapter" matrices (~0.1-1% of params), train only the adapters. Trains in hours on consumer GPUs, no quality loss vs full fine-tuning for most tasks.
- QLoRA — same as LoRA but base model is 4-bit quantised. Lets you fine-tune 70B models on a single A100.
- DoRA — slight LoRA improvement, decomposes weight updates into magnitude + direction. Marginally better.
LoRA-based fine-tuning is now the default. The adapters are tiny (10-100 MB), easy to deploy, easy to swap.
Supervised fine-tuning (SFT)
The basic form: collect 500-10,000 input/output pairs of the behaviour you want, train on them with next-token prediction loss. Done in 4-8 hours on a single GPU.
Data is everything. Quality > quantity. 500 well-curated examples beat 50,000 noisy ones. Make sure your data covers edge cases and includes the exact format you want.
Preference fine-tuning (DPO, ORPO, KTO)
What if you don't have "right answers," just preferences between two options? Modern preference methods learn directly from preference pairs:
- DPO (Direct Preference Optimisation) — given (prompt, chosen, rejected) triples, train the model to prefer chosen. Simple, stable, replaces classical RLHF.
- ORPO — combines SFT and preference learning in one pass. Faster.
- KTO — needs only binary "good"/"bad" labels, not pairs.
For most teams: SFT to set behaviour, then DPO to refine on edge cases or alignment goals.
Cost reality
Fine-tuning costs in 2026:
- LoRA fine-tuning of a 7B on 5,000 examples: ~$5-20 on Together AI or RunPod
- LoRA on 70B: ~$50-200
- Hosted fine-tuning APIs (OpenAI, Anthropic, Google): more expensive but turn-key
Compared to prompt engineering's near-zero cost, fine-tuning is a real investment. Make sure your eval system can prove the fine-tune is winning.
The deployment story
After training, you have either:
- An adapter to load on top of the base model (LoRA) — small, fast to deploy
- A new fine-tuned checkpoint (full fine-tune) — large
Open-weight models with LoRA adapters can be served with vLLM, Ollama, or any compatible inference engine. Closed models (OpenAI, Anthropic) host your fine-tunes for you at a higher per-token price.
Final advice
Most teams fine-tune too early. Try in order: (1) better prompts, (2) few-shot examples, (3) RAG, (4) prompt caching, (5) THEN consider fine-tuning. By the time you get there, you'll know exactly what you need.