Training a dense 1-trillion-parameter model would require roughly 25 million GPU-hours — about $75M at current H100 pricing. MoE models achieve the same quality with 5-10x less compute by activating only a fraction of parameters per token. DeepSeek V3 (671B total, 37B activated) outperforms Llama 3 405B (dense) while training for a fraction of the cost.
This post covers the full MoE stack: router design, load balancing, fine-grained experts, expert parallelism systems, and the production serving challenges.
Why MoE: The Scaling Law Motivation
The Chinchilla scaling law shows that model quality improves predictably with both parameter count and training tokens. But FLOPs scale linearly with activated parameters — not total parameters. MoE exploits this gap:
A 671B MoE model activating 37B per token costs the same FLOPs-per-token as a 37B dense model, but has the representational capacity of a much larger model. The quality comes from having more specialized parameters to choose from, even though each token only uses a small fraction.
Dense vs MoE Training Economics
| Model | Total Params | Activated/Token | Training FLOPS | Quality (MMLU) |
|---|---|---|---|---|
| Llama 3 70B (dense) | 70B | 70B | ~6.4M H100-hrs | 79.5 |
| Llama 3 405B (dense) | 405B | 405B | ~30.8M H100-hrs | 86.1 |
| Mixtral 8x7B | 47B | 13B | ~1.2M H100-hrs | 70.6 |
| DeepSeek V3 (MoE) | 671B | 37B | ~2.8M H800-hrs | 87.1 |
Router Design: How Tokens Choose Experts
The router (also called gating network) is a small linear layer that takes a token’s hidden state and produces logits over all experts:
The top- experts by logit value are selected, and their outputs are weighted by the corresponding gate values.
Top-K Routing
The standard approach. Each token picks its top- experts:
def top_k_routing(hidden, router_weight, k=8):
logits = hidden @ router_weight.T # [batch, num_experts]
topk_vals, topk_ids = logits.topk(k, dim=-1) # [batch, k]
gates = F.softmax(topk_vals, dim=-1) # Normalize over selected experts
return topk_ids, gates
Problem: Popular experts get overwhelmed with tokens (load imbalance), while unpopular experts are underutilized.
Expert Choice Routing
Inversion: instead of tokens choosing experts, experts choose tokens. Each expert picks its top- tokens from the batch. This guarantees perfect load balance by construction.
Problem: Some tokens may not be selected by any expert (token dropping) or selected by too many (compute waste).
Group-Limited Gating (DeepSeek V3)
A hybrid: tokens choose experts, but with a constraint — no expert can receive more than a group capacity limit within a group of tokens. This approximates load balance without the rigidity of expert choice.
Load Balancing: The Central Challenge
If all tokens route to the same 2 experts, the other 254 sit idle. Throughput drops to single-expert speed. Load balancing is the most important MoE systems challenge.
Auxiliary Loss Approach (Switch Transformer, GShard)
Add a differentiable loss term that penalizes imbalanced routing:
where is the fraction of tokens routed to expert and is the average router probability for expert . This loss pushes the router toward uniform distribution.
The problem: This loss directly competes with the language modeling objective. The router must balance between “send this token to the expert that would process it best” and “send it somewhere underutilized.” This tension reduces final model quality by 0.1-0.3% on benchmarks.
Auxiliary-Loss-Free Balancing (DeepSeek V3)
Instead of a loss, add bias terms to router logits:
The biases are not trained by gradient descent. Instead, a simple control rule:
- If expert is overloaded: decrease
- If expert is underloaded: increase
This is a control system operating alongside training, with zero gradient interference. The result: perfect load balance with no quality compromise.
At frontier scale, 0.1% quality improvement matters — it can be the difference between passing and failing a benchmark threshold. By eliminating the auxiliary loss, DeepSeek V3 gets both perfect balance AND maximum quality. This is one of their most impactful and easily portable innovations.
Fine-Grained Experts: 256 with Top-8
Why Not 8 Experts with Top-2?
Mixtral uses 8 experts with top-2 routing: possible combinations per token. DeepSeek V3 uses 256 experts with top-8: combinations.
More combinations enables finer-grained specialization. Each expert can focus on a narrow knowledge domain (specific language patterns, code syntax, mathematical reasoning) because the combinatorial space is rich enough that tokens always find a relevant subset.
Shared Experts
One expert processes every token regardless of routing. This shared expert captures universal knowledge — function words, basic syntax, common patterns — preventing the routed experts from wasting capacity on things every token needs.
def moe_forward(hidden, shared_expert, routed_experts, router, k=8):
# Shared expert: processes every token
shared_output = shared_expert(hidden)
# Routing
expert_ids, gates = router(hidden, k=k)
# Routed experts: only selected ones fire
routed_output = torch.zeros_like(hidden)
for i in range(k):
expert_id = expert_ids[:, i]
gate = gates[:, i].unsqueeze(-1)
expert_out = routed_experts[expert_id](hidden)
routed_output += gate * expert_out
return shared_output + routed_output
Expert Parallelism: All-to-All Communication
With 256 experts across 64 GPUs (~4 experts per GPU), tokens must be dispatched to the GPU holding their selected experts, then combined back:
Dispatch Phase
- Router determines which experts each token needs
- Token hidden states are sent to the GPUs holding those experts
- This is an all-to-all communication pattern — every GPU potentially sends to every other GPU
Combine Phase
- Each expert processes its received tokens
- Expert outputs are sent back to the original GPUs
- Gated outputs are aggregated
Communication Volume
For batch size , hidden dim , top- routing, and FP16:
For , , (DeepSeek V3): per layer.
DeepEP: Optimized MoE Communication
DeepSeek’s DeepEP library provides two kernel types:
High-throughput kernels: Exploit asymmetric NVLink/RDMA bandwidth. NVLink intra-node: ~160 GB/s. RDMA inter-node: ~50 GB/s. The kernels schedule NVLink and RDMA transfers simultaneously, achieving near-peak utilization on both.
Low-latency kernels: Use pure RDMA without consuming any streaming multiprocessors. This is critical: if all SMs are busy with expert computation, communication kernels can’t launch. By using RDMA-only communication, the SMs remain free for compute, enabling true computation-communication overlap.
DeepEP Performance (H800, 8 nodes)
| Operation | NVLink BW | RDMA BW | Latency (dispatch) |
|---|---|---|---|
| High-throughput dispatch | 153-158 GB/s | 43-58 GB/s | ~2.1 ms |
| High-throughput combine | 153-158 GB/s | 43-58 GB/s | ~1.8 ms |
| Low-latency dispatch | N/A (RDMA only) | ~50 GB/s | 77 us |
| Low-latency combine | N/A (RDMA only) | ~50 GB/s | 68 us |
Expert Offloading: MoE on Consumer Hardware
When 256 experts don’t fit in GPU memory (e.g., serving on a single GPU), expert offloading moves inactive experts to CPU or SSD:
- Prediction: The router determines which experts are needed 1-2 layers ahead
- Prefetch: Predicted experts are loaded from CPU/SSD to GPU asynchronously
- Compute: Active experts process tokens while next experts are loading
- Eviction: Used experts are evicted to make room
This enables serving Mixtral 8x7B on a single 24GB GPU (only 2 experts active at once = ~14B parameters in GPU memory).
MoE vs Dense at Different Scales
Quality (MMLU) per Training FLOP
(MMLU)At small scale (activated params under 7B): dense often wins. Routing overhead (all-to-all communication, gating computation) dominates. Expert specialization doesn’t emerge with too few experts.
At medium scale (7-70B activated): MoE starts winning on quality-per-FLOP. Mixtral 8x7B (13B activated) approaches 70B dense quality.
At large scale (37B+ activated): MoE is the only practical path to frontier quality. DeepSeek V3 demonstrates that 671B MoE matches 405B dense at 1/10 the training cost.
Serving Challenges
MoE models present unique serving difficulties:
-
Memory: All 256 experts must be accessible, even though only 8 fire per token. Total parameter footprint is 671B regardless of activation sparsity.
-
Latency: All-to-all communication adds latency per layer. For inter-node RDMA: ~77us per dispatch+combine = ~1.5ms per layer = ~120ms per forward pass across 80 layers. This is significant for latency-sensitive applications.
-
Load imbalance at serving time: Unlike training where batches are large, serving requests arrive unpredictably. Some experts may be overloaded while others idle. Dynamic batch sizes exacerbate this.
-
Expert parallelism overhead: Compared to tensor parallelism (2 all-reduces per layer at ~10us each on NVLink), expert parallelism requires all-to-all communication that’s harder to overlap.
MoE models offer better quality-per-FLOP for training, but worse latency-per-token for serving compared to dense models of the same activated size. Choose MoE when training cost dominates your total cost (most scenarios). Choose dense when serving latency is the binding constraint and you can afford the training budget.
When to Choose MoE vs Dense
MoE vs Dense Decision Framework
| Factor | Favors MoE | Favors Dense |
|---|---|---|
| Training budget | Limited compute budget | Unlimited compute |
| Model quality target | Frontier quality needed | Moderate quality sufficient |
| Serving latency | Throughput-optimized | Latency-critical (p99 matters) |
| Infrastructure | Multi-node with fast interconnect | Single-node serving |
| Model size | 100B+ total params beneficial | Under 70B total is fine |
| Complexity tolerance | Can manage routing, EP, load balancing | Want simple TP/DP |
The trend is clear: as models grow, MoE becomes increasingly dominant. Every frontier model in 2025 uses MoE (DeepSeek V3, Mixtral, Grok, Gemini). Dense models (Llama 3) persist primarily because Meta can afford the training compute.