LLMAtlas — The Open Ecosystem Workspace for LLMs

You don't have to use a SaaS LLM. Open-weight models + the right inference engine can give you a frontier-tier API running on your own hardware. Here's how.

When self-hosting makes sense

Three legit reasons:

Privacy/compliance — data can't leave your network (healthcare, defence, internal legal)
Cost at scale — above 100M tokens/day, self-hosting frequently undercuts API pricing
Latency/control — you need sub-100ms TTFT from a specific location, or guaranteed availability

For everything else, APIs are cheaper and easier.

The inference engines

This is the single most important choice:

vLLM — open-source, fastest for batched inference, supports most architectures. The default for serious self-hosting.
TensorRT-LLM — NVIDIA's official engine, fastest absolute throughput on H100/H200, harder to set up.
SGLang — newer, optimised for complex prompts and tool use, gaining traction.
Ollama — easiest to use, single-binary, optimised for local dev / single-user. Not for production at scale.
llama.cpp — runs on CPU/Mac/ARM. Best for personal/edge deployments.
Text Generation Inference (TGI) — Hugging Face's engine. Good but slower than vLLM in 2026.

Hardware planning

Rough guide for 2026 hardware:

Model	Min GPU	Recommended	Throughput
Llama 3 8B	1× RTX 4090 (24GB)	1× A100 40GB	100-200 tok/s
Llama 3.3 70B	2× A100 80GB	4× A100 80GB	30-50 tok/s
Llama 4 Scout (MoE)	4× H100 80GB	8× H100 80GB	50-80 tok/s
DeepSeek V3 (671B)	8× H100 80GB	16× H100 80GB	30-60 tok/s

For most use cases, Llama 3.3 70B on 2× A100 is the sweet spot of quality, cost, and operational simplicity.

Quantisation

Quantising a model (4-bit or 8-bit weights) shrinks memory by 2-4× with minimal quality loss for sizes ≥7B:

AWQ — 4-bit, very good quality preservation, well-supported
GPTQ — older 4-bit, slightly worse than AWQ
FP8 — newer, supports H100 native FP8, best speed
BitsAndBytes (NF4) — easy via QLoRA workflow, slower

Quality loss for 70B at 4-bit is typically < 1% on most benchmarks. For smaller models (≤8B), quantisation hurts more.

Deployment patterns

Single-tenant — one model, dedicated GPUs, simple Kubernetes deployment. Best for known workloads.

Multi-tenant — multiple models sharing GPU pools, dynamic loading. Engines like vLLM-routing and Aibrix handle this.

Edge — small models (3B-8B) deployed close to users via Cloudflare Workers AI, Modal, Replicate, or your own POPs.

The opex math

A reasonable self-hosting bill:

8× H100 server: $30-50K/month (rented) or ~$300K capex
Power + ops: ~$5K/month
Total: $35-55K/month for a serving rig that handles ~10M tokens/hour at 70B

Compare to APIs: 10M tokens × 30 days × 24h = 7.2B tokens/month at $0.50/M average = $3,600. Self-hosting only wins above massive volumes — or when you genuinely need the privacy/control.

What you've accomplished

You now understand the full LLM stack from transformer mechanics through self-hosting. You can:

Choose the right model for any task
Engineer prompts that actually work
Build RAG and agent systems that ship
Evaluate, monitor, and harden production AI
Decide when to fine-tune, when to self-host, when to call an API

That's the entire 2026 AI engineering toolkit. Welcome to the frontier.

Self-Hosting the Open-Source Stack