Self-Hosting the Open-Source Stack
When privacy, cost, or control matters — run your own LLMs.
You don't have to use a SaaS LLM. Open-weight models + the right inference engine can give you a frontier-tier API running on your own hardware. Here's how.
When self-hosting makes sense
Three legit reasons:
- Privacy/compliance — data can't leave your network (healthcare, defence, internal legal)
- Cost at scale — above 100M tokens/day, self-hosting frequently undercuts API pricing
- Latency/control — you need sub-100ms TTFT from a specific location, or guaranteed availability
For everything else, APIs are cheaper and easier.
The inference engines
This is the single most important choice:
- vLLM — open-source, fastest for batched inference, supports most architectures. The default for serious self-hosting.
- TensorRT-LLM — NVIDIA's official engine, fastest absolute throughput on H100/H200, harder to set up.
- SGLang — newer, optimised for complex prompts and tool use, gaining traction.
- Ollama — easiest to use, single-binary, optimised for local dev / single-user. Not for production at scale.
- llama.cpp — runs on CPU/Mac/ARM. Best for personal/edge deployments.
- Text Generation Inference (TGI) — Hugging Face's engine. Good but slower than vLLM in 2026.
Hardware planning
Rough guide for 2026 hardware:
| Model | Min GPU | Recommended | Throughput |
|---|---|---|---|
| Llama 3 8B | 1× RTX 4090 (24GB) | 1× A100 40GB | 100-200 tok/s |
| Llama 3.3 70B | 2× A100 80GB | 4× A100 80GB | 30-50 tok/s |
| Llama 4 Scout (MoE) | 4× H100 80GB | 8× H100 80GB | 50-80 tok/s |
| DeepSeek V3 (671B) | 8× H100 80GB | 16× H100 80GB | 30-60 tok/s |
For most use cases, Llama 3.3 70B on 2× A100 is the sweet spot of quality, cost, and operational simplicity.
Quantisation
Quantising a model (4-bit or 8-bit weights) shrinks memory by 2-4× with minimal quality loss for sizes ≥7B:
- AWQ — 4-bit, very good quality preservation, well-supported
- GPTQ — older 4-bit, slightly worse than AWQ
- FP8 — newer, supports H100 native FP8, best speed
- BitsAndBytes (NF4) — easy via QLoRA workflow, slower
Quality loss for 70B at 4-bit is typically < 1% on most benchmarks. For smaller models (≤8B), quantisation hurts more.
Deployment patterns
Single-tenant — one model, dedicated GPUs, simple Kubernetes deployment. Best for known workloads.
Multi-tenant — multiple models sharing GPU pools, dynamic loading. Engines like vLLM-routing and Aibrix handle this.
Edge — small models (3B-8B) deployed close to users via Cloudflare Workers AI, Modal, Replicate, or your own POPs.
The opex math
A reasonable self-hosting bill:
- 8× H100 server: $30-50K/month (rented) or ~$300K capex
- Power + ops: ~$5K/month
- Total: $35-55K/month for a serving rig that handles ~10M tokens/hour at 70B
Compare to APIs: 10M tokens × 30 days × 24h = 7.2B tokens/month at $0.50/M average = $3,600. Self-hosting only wins above massive volumes — or when you genuinely need the privacy/control.
What you've accomplished
You now understand the full LLM stack from transformer mechanics through self-hosting. You can:
- Choose the right model for any task
- Engineer prompts that actually work
- Build RAG and agent systems that ship
- Evaluate, monitor, and harden production AI
- Decide when to fine-tune, when to self-host, when to call an API
That's the entire 2026 AI engineering toolkit. Welcome to the frontier.