Mixture-of-Experts Architectures
How Llama 4, DeepSeek V3, and Mixtral run frontier-quality at a fraction of the cost.
A Mixture-of-Experts (MoE) model has lots of parameters but only uses a fraction of them per token. This decouples model size from inference cost — the breakthrough that made open-weight frontier models economically viable.
Mixture-of-Experts: router picks top-2 of 8 experts per token
For this token, only 2 of 8 experts run. Total params high; per-token compute low.
How MoE works
Each transformer layer's feed-forward network is split into N experts (small specialised networks). A learned router picks the top-k experts for each token. Only those k experts are activated; the rest sit idle.
- Llama 4 Maverick: 128 experts, top-2 active. 17B "active" params out of 400B total.
- DeepSeek V3: 256 experts, top-8 active. 37B active out of 671B total.
- Mixtral 8x22B: 8 experts, top-2 active. 39B active out of 141B total.
The model behaves like a much smaller network at inference time while having access to a much larger pool of expertise.
Why MoE matters
Three big effects:
- Quality scales with total params — DeepSeek V3 (671B) plays in the GPT-4 weight class on benchmarks.
- Speed scales with active params — 37B active = inference speed of a 37B dense model.
- Cost scales with active tokens — providers like Together AI bill MoE models based on active params, making them dramatically cheaper than equivalent dense models.
This is why open-weight frontier exists. A dense 671B model would be unusable in production. MoE makes it cheap to serve.
The downside
MoE models have caveats:
- Memory — you still need all 671B params loaded in GPU memory, even if only 37B are active per token. Self-hosting requires ~8× more VRAM than active params suggest.
- Routing instability — load imbalance across experts can hurt quality. Modern training (auxiliary load-balancing losses) mostly fixes this.
- Quantisation harder — MoE models lose more quality from aggressive quantisation than dense models.
Reading MoE specs
When you see "Llama 4 Maverick 17B-128E," decode it as:
- 17B — active parameters per token (controls inference speed)
- 128E — 128 experts in each MoE layer
- Total params: ~17B × (128 / active_top_k) = ~400B
When you see "DeepSeek V3," it's 671B total, 37B active. The cost on Together AI is ~$0.27/$1.10 per M — comparable to a 37B dense model, with frontier-tier quality.
Where MoE is going
The future is increasingly MoE. Llama 4, GPT-4-class architectures, and most frontier open models are MoE in 2026. Dense models survive only at the small end (≤30B) where the routing overhead isn't worth it.
For your purposes: when picking models, MoE on Together AI or self-hosted gives you the best quality-per-dollar in the open-weight world.