Part of Series Transformer Anatomy 10 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

Training a dense 1-trillion-parameter model would require roughly 25 million GPU-hours — about $75M at current H100 pricing. MoE models achieve the same quality with 5-10x less compute by activating only a fraction of parameters per token. DeepSeek V3 (671B total, 37B activated) outperforms Llama 3 405B (dense) while training for a fraction of the cost.

This post covers the full MoE stack: router design, load balancing, fine-grained experts, expert parallelism systems, and the production serving challenges.

Why MoE: The Scaling Law Motivation

The Chinchilla scaling law shows that model quality improves predictably with both parameter count and training tokens. But FLOPs scale linearly with activated parameters — not total parameters. MoE exploits this gap:

FLOPs per tokenPactivated,QualityPtotal0.5×D0.5\text{FLOPs per token} \propto P_{\text{activated}}, \quad \text{Quality} \propto P_{\text{total}}^{0.5} \times D^{0.5}

A 671B MoE model activating 37B per token costs the same FLOPs-per-token as a 37B dense model, but has the representational capacity of a much larger model. The quality comes from having more specialized parameters to choose from, even though each token only uses a small fraction.

📊

Dense vs MoE Training Economics

ModelTotal ParamsActivated/TokenTraining FLOPSQuality (MMLU)
Llama 3 70B (dense) 70B 70B ~6.4M H100-hrs 79.5
Llama 3 405B (dense) 405B 405B ~30.8M H100-hrs 86.1
Mixtral 8x7B 47B 13B ~1.2M H100-hrs 70.6
DeepSeek V3 (MoE) 671B 37B ~2.8M H800-hrs 87.1
Note: DeepSeek V3 achieves 405B-class quality at roughly 1/10 the training cost.

Router Design: How Tokens Choose Experts

The router (also called gating network) is a small linear layer that takes a token’s hidden state hh and produces logits over all experts:

g(h)=softmax(Wgh+b)g(h) = \text{softmax}(W_g \cdot h + b)

The top-kk experts by logit value are selected, and their outputs are weighted by the corresponding gate values.

Top-K Routing

The standard approach. Each token picks its top-kk experts:

def top_k_routing(hidden, router_weight, k=8):
    logits = hidden @ router_weight.T           # [batch, num_experts]
    topk_vals, topk_ids = logits.topk(k, dim=-1)  # [batch, k]
    gates = F.softmax(topk_vals, dim=-1)         # Normalize over selected experts
    return topk_ids, gates

Problem: Popular experts get overwhelmed with tokens (load imbalance), while unpopular experts are underutilized.

Expert Choice Routing

Inversion: instead of tokens choosing experts, experts choose tokens. Each expert picks its top-CC tokens from the batch. This guarantees perfect load balance by construction.

Problem: Some tokens may not be selected by any expert (token dropping) or selected by too many (compute waste).

Group-Limited Gating (DeepSeek V3)

A hybrid: tokens choose experts, but with a constraint — no expert can receive more than a group capacity limit within a group of tokens. This approximates load balance without the rigidity of expert choice.

Load Balancing: The Central Challenge

If all tokens route to the same 2 experts, the other 254 sit idle. Throughput drops to single-expert speed. Load balancing is the most important MoE systems challenge.

Auxiliary Loss Approach (Switch Transformer, GShard)

Add a differentiable loss term that penalizes imbalanced routing:

Lbalance=αNi=1NfiPiL_{\text{balance}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i

where fif_i is the fraction of tokens routed to expert ii and PiP_i is the average router probability for expert ii. This loss pushes the router toward uniform distribution.

The problem: This loss directly competes with the language modeling objective. The router must balance between “send this token to the expert that would process it best” and “send it somewhere underutilized.” This tension reduces final model quality by 0.1-0.3% on benchmarks.

Auxiliary-Loss-Free Balancing (DeepSeek V3)

Instead of a loss, add bias terms bib_i to router logits:

gi=softmax(logiti+bi)g_i = \text{softmax}(\text{logit}_i + b_i)

The biases are not trained by gradient descent. Instead, a simple control rule:

  • If expert ii is overloaded: decrease bib_i
  • If expert ii is underloaded: increase bib_i

This is a control system operating alongside training, with zero gradient interference. The result: perfect load balance with no quality compromise.

Why Loss-Free Balancing Is a Big Deal

At frontier scale, 0.1% quality improvement matters — it can be the difference between passing and failing a benchmark threshold. By eliminating the auxiliary loss, DeepSeek V3 gets both perfect balance AND maximum quality. This is one of their most impactful and easily portable innovations.

Fine-Grained Experts: 256 with Top-8

Why Not 8 Experts with Top-2?

Mixtral uses 8 experts with top-2 routing: (82)=28\binom{8}{2} = 28 possible combinations per token. DeepSeek V3 uses 256 experts with top-8: (2568)4.4×1013\binom{256}{8} \approx 4.4 \times 10^{13} combinations.

More combinations enables finer-grained specialization. Each expert can focus on a narrow knowledge domain (specific language patterns, code syntax, mathematical reasoning) because the combinatorial space is rich enough that tokens always find a relevant subset.

Shared Experts

One expert processes every token regardless of routing. This shared expert captures universal knowledge — function words, basic syntax, common patterns — preventing the routed experts from wasting capacity on things every token needs.

def moe_forward(hidden, shared_expert, routed_experts, router, k=8):
    # Shared expert: processes every token
    shared_output = shared_expert(hidden)

    # Routing
    expert_ids, gates = router(hidden, k=k)

    # Routed experts: only selected ones fire
    routed_output = torch.zeros_like(hidden)
    for i in range(k):
        expert_id = expert_ids[:, i]
        gate = gates[:, i].unsqueeze(-1)
        expert_out = routed_experts[expert_id](hidden)
        routed_output += gate * expert_out

    return shared_output + routed_output

Expert Parallelism: All-to-All Communication

With 256 experts across 64 GPUs (~4 experts per GPU), tokens must be dispatched to the GPU holding their selected experts, then combined back:

Dispatch Phase

  1. Router determines which experts each token needs
  2. Token hidden states are sent to the GPUs holding those experts
  3. This is an all-to-all communication pattern — every GPU potentially sends to every other GPU

Combine Phase

  1. Each expert processes its received tokens
  2. Expert outputs are sent back to the original GPUs
  3. Gated outputs are aggregated

Communication Volume

For batch size BB, hidden dim DD, top-kk routing, and FP16:

Dispatch bytes=B×k×D×2(each token sent to k experts)\text{Dispatch bytes} = B \times k \times D \times 2 \quad \text{(each token sent to k experts)}

For B=4096B = 4096, k=8k = 8, D=7168D = 7168 (DeepSeek V3): Dispatch=4096×8×7168×2=449 MB\text{Dispatch} = 4096 \times 8 \times 7168 \times 2 = 449 \text{ MB} per layer.

DeepEP: Optimized MoE Communication

DeepSeek’s DeepEP library provides two kernel types:

High-throughput kernels: Exploit asymmetric NVLink/RDMA bandwidth. NVLink intra-node: ~160 GB/s. RDMA inter-node: ~50 GB/s. The kernels schedule NVLink and RDMA transfers simultaneously, achieving near-peak utilization on both.

Low-latency kernels: Use pure RDMA without consuming any streaming multiprocessors. This is critical: if all SMs are busy with expert computation, communication kernels can’t launch. By using RDMA-only communication, the SMs remain free for compute, enabling true computation-communication overlap.

📊

DeepEP Performance (H800, 8 nodes)

OperationNVLink BWRDMA BWLatency (dispatch)
High-throughput dispatch 153-158 GB/s 43-58 GB/s ~2.1 ms
High-throughput combine 153-158 GB/s 43-58 GB/s ~1.8 ms
Low-latency dispatch N/A (RDMA only) ~50 GB/s 77 us
Low-latency combine N/A (RDMA only) ~50 GB/s 68 us
Note: Low-latency kernels use zero SMs, enabling full overlap with expert computation.

Expert Offloading: MoE on Consumer Hardware

When 256 experts don’t fit in GPU memory (e.g., serving on a single GPU), expert offloading moves inactive experts to CPU or SSD:

  1. Prediction: The router determines which experts are needed 1-2 layers ahead
  2. Prefetch: Predicted experts are loaded from CPU/SSD to GPU asynchronously
  3. Compute: Active experts process tokens while next experts are loading
  4. Eviction: Used experts are evicted to make room

This enables serving Mixtral 8x7B on a single 24GB GPU (only 2 experts active at once = ~14B parameters in GPU memory).

MoE vs Dense at Different Scales

Quality (MMLU) per Training FLOP

(MMLU)
7B Dense
62 MMLU
Mixtral 8x7B (MoE) 13B activated, 47B total
71 MMLU
70B Dense
80 MMLU
DeepSeek V3 (MoE) 37B activated, 671B total
87 MMLU
405B Dense
86 MMLU

At small scale (activated params under 7B): dense often wins. Routing overhead (all-to-all communication, gating computation) dominates. Expert specialization doesn’t emerge with too few experts.

At medium scale (7-70B activated): MoE starts winning on quality-per-FLOP. Mixtral 8x7B (13B activated) approaches 70B dense quality.

At large scale (37B+ activated): MoE is the only practical path to frontier quality. DeepSeek V3 demonstrates that 671B MoE matches 405B dense at 1/10 the training cost.

Serving Challenges

MoE models present unique serving difficulties:

  1. Memory: All 256 experts must be accessible, even though only 8 fire per token. Total parameter footprint is 671B regardless of activation sparsity.

  2. Latency: All-to-all communication adds latency per layer. For inter-node RDMA: ~77us per dispatch+combine = ~1.5ms per layer = ~120ms per forward pass across 80 layers. This is significant for latency-sensitive applications.

  3. Load imbalance at serving time: Unlike training where batches are large, serving requests arrive unpredictably. Some experts may be overloaded while others idle. Dynamic batch sizes exacerbate this.

  4. Expert parallelism overhead: Compared to tensor parallelism (2 all-reduces per layer at ~10us each on NVLink), expert parallelism requires all-to-all communication that’s harder to overlap.

⚠️ The Serving Tradeoff

MoE models offer better quality-per-FLOP for training, but worse latency-per-token for serving compared to dense models of the same activated size. Choose MoE when training cost dominates your total cost (most scenarios). Choose dense when serving latency is the binding constraint and you can afford the training budget.

When to Choose MoE vs Dense

📊

MoE vs Dense Decision Framework

FactorFavors MoEFavors Dense
Training budget Limited compute budget Unlimited compute
Model quality target Frontier quality needed Moderate quality sufficient
Serving latency Throughput-optimized Latency-critical (p99 matters)
Infrastructure Multi-node with fast interconnect Single-node serving
Model size 100B+ total params beneficial Under 70B total is fine
Complexity tolerance Can manage routing, EP, load balancing Want simple TP/DP

The trend is clear: as models grow, MoE becomes increasingly dominant. Every frontier model in 2025 uses MoE (DeepSeek V3, Mixtral, Grok, Gemini). Dense models (Llama 3) persist primarily because Meta can afford the training compute.