Level 5 · Frontier & Mastery
8 min

Mixture-of-Experts Architectures

How Llama 4, DeepSeek V3, and Mixtral run frontier-quality at a fraction of the cost.

A Mixture-of-Experts (MoE) model has lots of parameters but only uses a fraction of them per token. This decouples model size from inference cost — the breakthrough that made open-weight frontier models economically viable.

Mixture-of-Experts: router picks top-2 of 8 experts per token

token: "protein"
Router (learned gating network)
E1
idle
E2
idle
E3
ACTIVE
E4
idle
E5
idle
E6
ACTIVE
E7
idle
E8
idle

For this token, only 2 of 8 experts run. Total params high; per-token compute low.

How MoE works

Each transformer layer's feed-forward network is split into N experts (small specialised networks). A learned router picks the top-k experts for each token. Only those k experts are activated; the rest sit idle.

  • Llama 4 Maverick: 128 experts, top-2 active. 17B "active" params out of 400B total.
  • DeepSeek V3: 256 experts, top-8 active. 37B active out of 671B total.
  • Mixtral 8x22B: 8 experts, top-2 active. 39B active out of 141B total.

The model behaves like a much smaller network at inference time while having access to a much larger pool of expertise.

Why MoE matters

Three big effects:

  1. Quality scales with total params — DeepSeek V3 (671B) plays in the GPT-4 weight class on benchmarks.
  2. Speed scales with active params — 37B active = inference speed of a 37B dense model.
  3. Cost scales with active tokens — providers like Together AI bill MoE models based on active params, making them dramatically cheaper than equivalent dense models.

This is why open-weight frontier exists. A dense 671B model would be unusable in production. MoE makes it cheap to serve.

The downside

MoE models have caveats:

  • Memory — you still need all 671B params loaded in GPU memory, even if only 37B are active per token. Self-hosting requires ~8× more VRAM than active params suggest.
  • Routing instability — load imbalance across experts can hurt quality. Modern training (auxiliary load-balancing losses) mostly fixes this.
  • Quantisation harder — MoE models lose more quality from aggressive quantisation than dense models.

Reading MoE specs

When you see "Llama 4 Maverick 17B-128E," decode it as:

  • 17B — active parameters per token (controls inference speed)
  • 128E — 128 experts in each MoE layer
  • Total params: ~17B × (128 / active_top_k) = ~400B

When you see "DeepSeek V3," it's 671B total, 37B active. The cost on Together AI is ~$0.27/$1.10 per M — comparable to a 37B dense model, with frontier-tier quality.

Where MoE is going

The future is increasingly MoE. Llama 4, GPT-4-class architectures, and most frontier open models are MoE in 2026. Dense models survive only at the small end (≤30B) where the routing overhead isn't worth it.

For your purposes: when picking models, MoE on Together AI or self-hosted gives you the best quality-per-dollar in the open-weight world.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What's the key advantage of MoE?

Q2.In 'Llama 4 Maverick 17B-128E', what does the 17B refer to?

Q3.Why does MoE require so much VRAM despite low active params?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs