Level 5 · Frontier & Mastery
9 min

Self-Hosting the Open-Source Stack

When privacy, cost, or control matters — run your own LLMs.

You don't have to use a SaaS LLM. Open-weight models + the right inference engine can give you a frontier-tier API running on your own hardware. Here's how.

When self-hosting makes sense

Three legit reasons:

  1. Privacy/compliance — data can't leave your network (healthcare, defence, internal legal)
  2. Cost at scale — above 100M tokens/day, self-hosting frequently undercuts API pricing
  3. Latency/control — you need sub-100ms TTFT from a specific location, or guaranteed availability

For everything else, APIs are cheaper and easier.

The inference engines

This is the single most important choice:

  • vLLM — open-source, fastest for batched inference, supports most architectures. The default for serious self-hosting.
  • TensorRT-LLM — NVIDIA's official engine, fastest absolute throughput on H100/H200, harder to set up.
  • SGLang — newer, optimised for complex prompts and tool use, gaining traction.
  • Ollama — easiest to use, single-binary, optimised for local dev / single-user. Not for production at scale.
  • llama.cpp — runs on CPU/Mac/ARM. Best for personal/edge deployments.
  • Text Generation Inference (TGI) — Hugging Face's engine. Good but slower than vLLM in 2026.

Hardware planning

Rough guide for 2026 hardware:

ModelMin GPURecommendedThroughput
Llama 3 8B1× RTX 4090 (24GB)1× A100 40GB100-200 tok/s
Llama 3.3 70B2× A100 80GB4× A100 80GB30-50 tok/s
Llama 4 Scout (MoE)4× H100 80GB8× H100 80GB50-80 tok/s
DeepSeek V3 (671B)8× H100 80GB16× H100 80GB30-60 tok/s

For most use cases, Llama 3.3 70B on 2× A100 is the sweet spot of quality, cost, and operational simplicity.

Quantisation

Quantising a model (4-bit or 8-bit weights) shrinks memory by 2-4× with minimal quality loss for sizes ≥7B:

  • AWQ — 4-bit, very good quality preservation, well-supported
  • GPTQ — older 4-bit, slightly worse than AWQ
  • FP8 — newer, supports H100 native FP8, best speed
  • BitsAndBytes (NF4) — easy via QLoRA workflow, slower

Quality loss for 70B at 4-bit is typically < 1% on most benchmarks. For smaller models (≤8B), quantisation hurts more.

Deployment patterns

Single-tenant — one model, dedicated GPUs, simple Kubernetes deployment. Best for known workloads.

Multi-tenant — multiple models sharing GPU pools, dynamic loading. Engines like vLLM-routing and Aibrix handle this.

Edge — small models (3B-8B) deployed close to users via Cloudflare Workers AI, Modal, Replicate, or your own POPs.

The opex math

A reasonable self-hosting bill:

  • 8× H100 server: $30-50K/month (rented) or ~$300K capex
  • Power + ops: ~$5K/month
  • Total: $35-55K/month for a serving rig that handles ~10M tokens/hour at 70B

Compare to APIs: 10M tokens × 30 days × 24h = 7.2B tokens/month at $0.50/M average = $3,600. Self-hosting only wins above massive volumes — or when you genuinely need the privacy/control.

What you've accomplished

You now understand the full LLM stack from transformer mechanics through self-hosting. You can:

  • Choose the right model for any task
  • Engineer prompts that actually work
  • Build RAG and agent systems that ship
  • Evaluate, monitor, and harden production AI
  • Decide when to fine-tune, when to self-host, when to call an API

That's the entire 2026 AI engineering toolkit. Welcome to the frontier.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Most common reason teams choose self-hosting over APIs?

Q2.Default inference engine for production self-hosting?

Q3.Typical quality impact of 4-bit quantisation on a 70B model?

Q4.At what scale does self-hosting typically beat API costs?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs