LLMAtlas — The Open Ecosystem Workspace for LLMs

A Mixture-of-Experts (MoE) model has lots of parameters but only uses a fraction of them per token. This decouples model size from inference cost — the breakthrough that made open-weight frontier models economically viable.

How MoE works

Each transformer layer's feed-forward network is split into N experts (small specialised networks). A learned router picks the top-k experts for each token. Only those k experts are activated; the rest sit idle.

Llama 4 Maverick: 128 experts, top-2 active. 17B "active" params out of 400B total.
DeepSeek V3: 256 experts, top-8 active. 37B active out of 671B total.
Mixtral 8x22B: 8 experts, top-2 active. 39B active out of 141B total.

The model behaves like a much smaller network at inference time while having access to a much larger pool of expertise.

Why MoE matters

Three big effects:

Quality scales with total params — DeepSeek V3 (671B) plays in the GPT-4 weight class on benchmarks.
Speed scales with active params — 37B active = inference speed of a 37B dense model.
Cost scales with active tokens — providers like Together AI bill MoE models based on active params, making them dramatically cheaper than equivalent dense models.

This is why open-weight frontier exists. A dense 671B model would be unusable in production. MoE makes it cheap to serve.

The downside

MoE models have caveats:

Memory — you still need all 671B params loaded in GPU memory, even if only 37B are active per token. Self-hosting requires ~8× more VRAM than active params suggest.
Routing instability — load imbalance across experts can hurt quality. Modern training (auxiliary load-balancing losses) mostly fixes this.
Quantisation harder — MoE models lose more quality from aggressive quantisation than dense models.

Reading MoE specs

When you see "Llama 4 Maverick 17B-128E," decode it as:

17B — active parameters per token (controls inference speed)
128E — 128 experts in each MoE layer
Total params: ~17B × (128 / active_top_k) = ~400B

When you see "DeepSeek V3," it's 671B total, 37B active. The cost on Together AI is ~$0.27/$1.10 per M — comparable to a 37B dense model, with frontier-tier quality.

Where MoE is going

The future is increasingly MoE. Llama 4, GPT-4-class architectures, and most frontier open models are MoE in 2026. Dense models survive only at the small end (≤30B) where the routing overhead isn't worth it.

For your purposes: when picking models, MoE on Together AI or self-hosted gives you the best quality-per-dollar in the open-weight world.

Mixture-of-Experts Architectures

How MoE works

Why MoE matters

The downside

Reading MoE specs

Where MoE is going

Knowledge Check