LLMAtlas — The Open Ecosystem Workspace for LLMs

Fine-tuning updates a model's weights on your data. It's the biggest hammer in the AI toolbox — and the most commonly misused. Reach for it only when prompting and RAG have failed.

When fine-tuning is the right answer

Three legit reasons to fine-tune:

Format conformance — your output needs a very specific structure the model can't reliably hit with prompting alone
Domain specialisation — finance, medicine, legal — where vocabulary and style differ enough from general training data
Cost reduction at scale — fine-tuning a 7B to match GPT-4 on your task can save 100× per inference

What it's not good for:

Adding knowledge (use RAG)
Improving general capability (impossible — you can only shape behaviour)
Fixing hallucinations (often makes them worse)

Full fine-tuning vs PEFT

Full fine-tuning updates every weight. Expensive (need 8× the GPU memory of the model), risks catastrophic forgetting, basically nobody does it for LLMs today.

Parameter-Efficient Fine-Tuning (PEFT) updates only a tiny subset:

LoRA (Low-Rank Adaptation) — freeze the base model, add small "adapter" matrices (~0.1-1% of params), train only the adapters. Trains in hours on consumer GPUs, no quality loss vs full fine-tuning for most tasks.
QLoRA — same as LoRA but base model is 4-bit quantised. Lets you fine-tune 70B models on a single A100.
DoRA — slight LoRA improvement, decomposes weight updates into magnitude + direction. Marginally better.

LoRA-based fine-tuning is now the default. The adapters are tiny (10-100 MB), easy to deploy, easy to swap.

Supervised fine-tuning (SFT)

The basic form: collect 500-10,000 input/output pairs of the behaviour you want, train on them with next-token prediction loss. Done in 4-8 hours on a single GPU.

Data is everything. Quality > quantity. 500 well-curated examples beat 50,000 noisy ones. Make sure your data covers edge cases and includes the exact format you want.

Preference fine-tuning (DPO, ORPO, KTO)

What if you don't have "right answers," just preferences between two options? Modern preference methods learn directly from preference pairs:

DPO (Direct Preference Optimisation) — given (prompt, chosen, rejected) triples, train the model to prefer chosen. Simple, stable, replaces classical RLHF.
ORPO — combines SFT and preference learning in one pass. Faster.
KTO — needs only binary "good"/"bad" labels, not pairs.

For most teams: SFT to set behaviour, then DPO to refine on edge cases or alignment goals.

Cost reality

Fine-tuning costs in 2026:

LoRA fine-tuning of a 7B on 5,000 examples: ~$5-20 on Together AI or RunPod
LoRA on 70B: ~$50-200
Hosted fine-tuning APIs (OpenAI, Anthropic, Google): more expensive but turn-key

Compared to prompt engineering's near-zero cost, fine-tuning is a real investment. Make sure your eval system can prove the fine-tune is winning.

The deployment story

After training, you have either:

An adapter to load on top of the base model (LoRA) — small, fast to deploy
A new fine-tuned checkpoint (full fine-tune) — large

Open-weight models with LoRA adapters can be served with vLLM, Ollama, or any compatible inference engine. Closed models (OpenAI, Anthropic) host your fine-tunes for you at a higher per-token price.

Final advice

Most teams fine-tune too early. Try in order: (1) better prompts, (2) few-shot examples, (3) RAG, (4) prompt caching, (5) THEN consider fine-tuning. By the time you get there, you'll know exactly what you need.

Fine-Tuning: LoRA, QLoRA, DPO, ORPO

Full fine-tuning

LoRA