Part of Series Inference Optimization Timeline 45 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Modern large language models do not fit on a single GPU. Llama 3 70B in FP16 requires roughly 140 GB of memory just for its parameters — and a single A100 has 80 GB. A 405B model needs over 800 GB. Training is even worse: optimizer states, gradients, and activations can multiply the memory footprint by 4-16x. You must distribute the work across multiple GPUs.

But how you split the work matters enormously. A naive approach can leave most GPUs idle, drown the interconnect in communication, or introduce latency-killing pipeline bubbles. This guide covers every major parallelism strategy used in production today — from the simplest (data parallelism) to the most exotic (expert parallelism, bidirectional pipelines) — and shows you how to combine them for large-scale training runs across thousands of GPUs.

Why Parallelism Is Unavoidable

Let us start with the arithmetic that forces our hand.

📊

Memory Requirements for Modern LLMs (FP16 Parameters Only)

ModelParametersFP16 Weight MemoryA100 80GB GPUs Needed (weights only)
Llama 3 8B 8B 16 GB 1
Llama 3 70B 70B 140 GB 2
Llama 3 405B 405B 810 GB 11
GPT-4 (estimated) ~1.8T MoE ~3.6 TB 45+
DeepSeek V3 671B MoE (37B active) ~1.3 TB 17+
Note: This is parameters only. Training adds optimizer states (2x for Adam), gradients (1x), and activations (variable). A 70B model at FP16 needs ~1 TB total during training.

For training, the memory multiplier is brutal. With Adam optimizer in mixed precision, each parameter costs:

  • 2 bytes for the FP16 parameter
  • 2 bytes for the FP16 gradient
  • 4 bytes for the FP32 master weight (for numerical stability)
  • 4 bytes for the FP32 first moment (Adam)
  • 4 bytes for the FP32 second moment (Adam)

That is 16 bytes per parameter. A 70B model needs 1.12 TB just for parameters + optimizer + gradients — before a single activation is stored. This is 14 A100s worth of memory, minimum.

⚠️ The Memory Wall

A single GPU cannot hold a 70B model for training. Even for inference, FP16 weights alone exceed one A100’s capacity. Parallelism is not optional — it is a hard requirement imposed by physics and economics.

The question is never “should I parallelize?” but rather “which combination of parallelism strategies minimizes total training time while fitting within my hardware constraints?”

Data Parallelism: The Foundation

Data parallelism is the simplest and most widely used strategy. The idea: replicate the entire model on every GPU, split the training data into shards, and have each GPU process a different mini-batch. After computing gradients locally, GPUs synchronize gradients so every replica takes the same optimizer step.

Distributed Data Parallel (DDP)

PyTorch’s DistributedDataParallel (DDP) is the workhorse of data parallelism. Each GPU (each “rank”) holds a complete copy of the model and processes a different portion of the mini-batch.

The training loop looks like:

  1. Each GPU receives a different mini-batch of data
  2. Each GPU runs forward pass independently
  3. Each GPU computes gradients independently
  4. All-reduce across all GPUs to average the gradients
  5. Each GPU applies the same optimizer step (parameters stay in sync)

The critical communication primitive is all-reduce: every GPU contributes its gradient tensor, and every GPU receives the sum (or average) of all contributions. Conceptually, each GPU sends MM bytes and receives MM bytes, where MM is the size of the model’s gradients.

In practice, all-reduce is implemented as a ring all-reduce or tree all-reduce that requires each GPU to send and receive 2MN1N2M \cdot \frac{N-1}{N} bytes total (where NN is the number of GPUs). For large NN, this approaches 2M2M — the volume is independent of GPU count.

DDP Memory Layout per GPU

Every GPU holds a full copy of everything. N GPUs = N copies of the model.

FP16 Parameters (full copy) 2 bytes x num_params Identical across all GPUs — redundant
FP16 Gradients (full copy) 2 bytes x num_params Computed locally, then all-reduced
FP32 Optimizer States (full copy) 12 bytes x num_params (Adam: master weights + m + v) Identical across all GPUs — massively redundant
Activations Varies with batch size and sequence length Unique per GPU (different data shards)

The problem with DDP is obvious from the memory layout: every GPU holds a complete copy of all parameters, gradients, and optimizer states. With NN GPUs, you have NN redundant copies of the optimizer states alone. For a 70B model, that is N×840 GBN \times 840\text{ GB} of optimizer state across the cluster — an enormous waste.

ℹ️ DDP Communication Pattern

DDP uses all-reduce for gradient synchronization. The total communication volume per step is approximately 2M2M bytes (where MM is gradient size), independent of the number of GPUs. DDP overlaps communication with backward computation by bucketing gradients and launching all-reduce as soon as each bucket is ready.

DDP works well when the model fits on a single GPU. You get near-linear throughput scaling because the only overhead is gradient all-reduce, which overlaps with computation. But for models that exceed single-GPU memory, DDP is impossible — you cannot even load the model.

ZeRO: Eliminating Redundancy

The Zero Redundancy Optimizer (ZeRO), introduced by Microsoft DeepSpeed, attacks the memory redundancy in DDP. The key insight: if every GPU holds identical optimizer states, gradients, and parameters, why not partition them across GPUs so each GPU holds only 1/N1/N of the total?

ZeRO comes in three stages, each eliminating more redundancy:

ZeRO Stage 1 (ZeRO-1): Partition Optimizer States

Instead of every GPU holding all of Adam’s mm and vv tensors, each GPU is responsible for 1/N1/N of the optimizer states. After all-reduce of gradients, each GPU updates only its assigned parameter partition using its local optimizer states.

Memory per GPU for a model with PP parameters:

ZeRO-1 memory=2P+2P+12PN=4P+12PN bytes\text{ZeRO-1 memory} = 2P + 2P + \frac{12P}{N} = 4P + \frac{12P}{N} \text{ bytes}

For N=64N=64 GPUs and a 70B model: 4×70B+12×70B64=280 GB+13.1 GB=293 GB4 \times 70\text{B} + \frac{12 \times 70\text{B}}{64} = 280\text{ GB} + 13.1\text{ GB} = 293\text{ GB}. Still too large for one GPU, but the optimizer state dropped from 840 GB to 13.1 GB per GPU.

ZeRO Stage 2 (ZeRO-2): Partition Gradients + Optimizer States

Each GPU holds only 1/N1/N of the gradients in addition to 1/N1/N of the optimizer states. Instead of all-reduce (which leaves a full gradient copy on every GPU), ZeRO-2 uses reduce-scatter: each GPU receives only the gradient shard it is responsible for.

ZeRO-2 memory=2P+2PN+12PN=2P+14PN bytes\text{ZeRO-2 memory} = 2P + \frac{2P}{N} + \frac{12P}{N} = 2P + \frac{14P}{N} \text{ bytes}

For N=64N=64 and 70B: 140 GB+15.3 GB=155 GB140\text{ GB} + 15.3\text{ GB} = 155\text{ GB}. Getting closer but still does not fit on one A100.

ZeRO Stage 3 (ZeRO-3): Partition Everything

The parameters themselves are also partitioned. Each GPU holds only 1/N1/N of the parameters, gradients, and optimizer states. When a layer needs the full parameter tensor for forward or backward computation, it performs an all-gather to collect all shards, computes, and then discards the non-local shards.

ZeRO-3 memory=2PN+2PN+12PN=16PN bytes\text{ZeRO-3 memory} = \frac{2P}{N} + \frac{2P}{N} + \frac{12P}{N} = \frac{16P}{N} \text{ bytes}

For N=64N=64 and 70B: 16×70B64=17.5 GB\frac{16 \times 70\text{B}}{64} = 17.5\text{ GB} per GPU. Now it fits comfortably on an A100.

📊

ZeRO Memory Savings per GPU (70B Model, FP16 + Adam, N=64 GPUs)

StrategyParams (GB)Gradients (GB)Optimizer (GB)Total (GB)Fits on A100 80GB?
DDP (no ZeRO) 140 140 840 1120 No
ZeRO-1 140 140 13.1 293 No
ZeRO-2 140 2.2 13.1 155 No
ZeRO-3 2.2 2.2 13.1 17.5 Yes
Note: Excludes activation memory, which depends on batch size, sequence length, and activation checkpointing. ZeRO-3 numbers are approximate — the all-gathered parameters are transiently in memory during forward/backward.
ZeRO Communication Trade-offs

ZeRO-1 has the same communication volume as DDP (one all-reduce per step). ZeRO-2 replaces all-reduce with reduce-scatter (half the volume: MM instead of 2M2M). ZeRO-3 adds all-gather operations for forward and backward passes, roughly tripling total communication volume compared to DDP. The memory savings come at a communication cost.

FSDP: PyTorch’s Native ZeRO-3

Fully Sharded Data Parallelism (FSDP) is PyTorch’s native implementation of ZeRO-3. The semantics are identical: parameters are sharded across GPUs, all-gathered on demand for computation, and discarded after use.

FSDP’s communication pattern per training step:

  1. Forward pass: For each layer, all-gather parameter shards from all GPUs, compute the forward pass, then discard non-local shards
  2. Backward pass: For each layer, all-gather parameters again (or recompute via activation checkpointing), compute gradients, then reduce-scatter to distribute gradient shards
  3. Optimizer step: Each GPU updates only its local parameter shard using its local gradient shard and optimizer states

The communication primitives:

  • All-gather: Each GPU broadcasts its shard; every GPU receives the full tensor. Volume per GPU: MN1NM \cdot \frac{N-1}{N} sent, MN1NM \cdot \frac{N-1}{N} received.
  • Reduce-scatter: Each GPU sends partial gradients; each receives its reduced shard. Volume per GPU: MN1NM \cdot \frac{N-1}{N} sent, MN\frac{M}{N} received.

Total communication volume per step for FSDP/ZeRO-3 is approximately 3M3M — about 1.5x that of DDP’s 2M2M all-reduce. The trade-off is dramatically lower memory usage.

Communication Volume Comparison (Normalized to DDP)

line
Metric DDPZeRO-1ZeRO-2ZeRO-3 / FSDP
Communication Volume (relative)
1
1
0.5
1.5
Memory Usage (relative)
1
0.26
0.14
0.016

When to Use DDP vs. FSDP/ZeRO

  • Model fits on one GPU: Use DDP. It has the lowest communication overhead and simplest implementation.
  • Model does not fit on one GPU: Use FSDP/ZeRO-3. The extra communication is unavoidable.
  • Model barely fits, but optimizer states blow memory: Use ZeRO-1 or ZeRO-2 for a middle ground.

Tensor Parallelism: Splitting Layers Across GPUs

Data parallelism replicates the model and splits data. Tensor parallelism does the opposite at the matrix level: it splits individual weight matrices across GPUs, so each GPU computes a portion of every layer’s output.

The Megatron-LM Column/Row Strategy

The breakthrough insight from the Megatron-LM paper is that you can split the two linear layers in a transformer’s FFN block in a complementary way that minimizes communication.

A standard transformer FFN block computes:

FFN(x)=Wdownσ(Wupx)\text{FFN}(x) = W_{\text{down}} \cdot \sigma(W_{\text{up}} \cdot x)

where WupRH×4HW_{\text{up}} \in \mathbb{R}^{H \times 4H} and WdownR4H×HW_{\text{down}} \in \mathbb{R}^{4H \times H} (with HH being the hidden dimension).

Megatron-LM splits this across NN GPUs as follows:

First linear (column-parallel): Split WupW_{\text{up}} along its output dimension (columns). Each GPU ii holds Wup(i)RH×4H/NW_{\text{up}}^{(i)} \in \mathbb{R}^{H \times 4H/N} and computes:

Yi=xWup(i)(no communication needed)Y_i = x \cdot W_{\text{up}}^{(i)} \quad \text{(no communication needed)}

Activation function: Apply σ\sigma (SiLU/GeLU) element-wise on each GPU’s local YiY_i. Since the activation is element-wise, no communication is needed.

Second linear (row-parallel): Split WdownW_{\text{down}} along its input dimension (rows). Each GPU ii holds Wdown(i)R4H/N×HW_{\text{down}}^{(i)} \in \mathbb{R}^{4H/N \times H} and computes:

Zi=σ(Yi)Wdown(i)Z_i = \sigma(Y_i) \cdot W_{\text{down}}^{(i)}

Now ZiZ_i is a partial sum — each GPU has computed part of the final output. An all-reduce sums these partials:

Z=i=1NZi=all_reduce(Zi)Z = \sum_{i=1}^{N} Z_i = \text{all\_reduce}(Z_i)

Megatron-LM Column/Row Parallel Strategy

FFN split: first linear is column-parallel (split output dim), second is row-parallel (split input dim)

Column-Parallel Linear (FFN Up-Project) Weight split along output dimension: W = [W_1 | W_2 | ... | W_N] Each GPU computes Y_i = X @ W_i (no communication needed!)
Activation (SiLU/GELU) Applied locally on each GPU's partition No communication — element-wise on local data
Row-Parallel Linear (FFN Down-Project) Weight split along input dimension: W = [W_1; W_2; ... ; W_N] Each GPU computes Z_i = Y_i @ W_i, then ALL-REDUCE to sum partials
Why This Split Pattern?

Column-parallel followed by row-parallel means the activation function is computed on already-partitioned data — no all-reduce needed between the two linears. The only synchronization point is after the row-parallel matmul, where partial sums from each GPU are combined via all-reduce. This halves the number of all-reduces compared to a naive split where you would need to gather outputs after every linear layer.

Attention Head Partitioning

For multi-head attention, tensor parallelism splits attention heads across GPUs. Each GPU handles num_heads/N\text{num\_heads}/N query heads and the corresponding KV heads.

📊

Attention Head Distribution (Llama 3 70B: 64 Q heads, 8 KV heads)

TP DegreeQ Heads per GPUKV Heads per GPUNotes
TP=1 64 8 All heads on one GPU
TP=2 32 4 Clean split
TP=4 16 2 Clean split
TP=8 8 1 Minimum: 1 KV head per GPU
Note: KV heads must divide evenly into TP degree. Llama 3 70B with 8 KV heads supports TP up to 8. Beyond that, KV heads must be replicated.

After the attention output projection (which is also split as a row-parallel linear), another all-reduce combines the partial results. So each transformer layer requires 2 all-reduce operations in the forward pass:

  1. One after the attention output projection
  2. One after the FFN down-projection

And correspondingly, 2 all-reduce operations in the backward pass, for a total of 4 all-reduces per layer per training step.

Communication Cost Analysis

📊

TP Communication per Transformer Layer (Llama 3 70B, FP16)

ComponentAll-Reduces per DirectionData per All-ReduceTotal per Layer
Attention output projection 1 B x L x 8192 x 2 bytes B x L x 16 KB
FFN down-projection 1 B x L x 8192 x 2 bytes B x L x 16 KB
Total per layer (forward) 2 -- B x L x 32 KB
Total per model (80 layers, forward + backward) 320 -- B x L x 5 MB
Note: B=batch, L=seq_len. At B=32, L=1 (decode): 160 KB per layer, 12.8 MB total per step.

The real question is: how long do these all-reduces take relative to the compute?

For decode (L=1, B=32): 12.8 MB total communication per step. On NVLink at 600 GB/s (H100), this is about 21 microseconds — utterly negligible compared to the milliseconds of compute.

For prefill (L=2048, B=4): roughly 1.3 GB total. On NVLink 4.0: about 2 ms. On PCIe Gen4 at 32 GB/s: about 40 ms. This is why tensor parallelism requires NVLink.

TP All-Reduce Overhead by Interconnect (Llama 3 70B, Prefill 2048 Tokens)

(ms)
NVLink 4.0 (H100, 900 GB/s) Negligible
1.4 ms
NVLink 3.0 (A100, 600 GB/s) Acceptable
2.1 ms
PCIe Gen5 (64 GB/s) Noticeable
10 ms
PCIe Gen4 (32 GB/s) Bottleneck
20 ms

TP Scaling Efficiency

📊

Tensor Parallelism Scaling (Llama 3 70B, A100 NVLink)

TP DegreeMemory per GPUThroughput (tok/s)Scaling EfficiencyNotes
TP=1 140+ GB (does not fit) -- -- Requires model parallelism
TP=2 ~75 GB 680 Baseline Fits on 80 GB A100
TP=4 ~40 GB 1250 92% Good scaling
TP=8 ~22 GB 2200 81% Communication starts to matter
Note: Beyond TP=8, efficiency drops sharply. Each GPU does less compute but communication volume per all-reduce stays constant.

The practical limit for tensor parallelism is TP=8 — which conveniently matches the number of GPUs connected by NVLink within a single node (DGX A100 or DGX H100). Going beyond TP=8 would require crossing the node boundary, where bandwidth drops from 600+ GB/s (NVLink) to 50-100 GB/s (InfiniBand), making TP communication a severe bottleneck.

⚠️ NVLink Is Non-Negotiable for TP

At PCIe bandwidths (~50 GB/s), tensor parallel communication can easily consume 30-50% of total step time for large prefills. NVLink at 600-900 GB/s brings this down to 2-5%. If you do not have NVLink, do not use tensor parallelism beyond TP=2 — prefer pipeline parallelism instead.

Implementation Pattern

# Simplified Megatron-LM style TP for FFN
class TensorParallelFFN(nn.Module):
    def __init__(self, d_model, d_ff, tp_rank, tp_world):
        super().__init__()
        # Column-parallel: split output dim
        self.up_proj = nn.Linear(d_model, d_ff // tp_world, bias=False)
        # Row-parallel: split input dim
        self.down_proj = nn.Linear(d_ff // tp_world, d_model, bias=False)
        self.tp_world = tp_world

    def forward(self, x):
        h = F.silu(self.up_proj(x))     # Local compute, no comm
        out = self.down_proj(h)           # Local partial result
        dist.all_reduce(out)              # Sum partials across TP group
        return out

Time Breakdown: TP=4 Llama 3 70B Decode Step (A100 NVLink)

(ms)
Local GEMM compute
3.8 ms
All-reduce communication
0.6 ms
Other (norm, activation, sampling)
0.4 ms
Total 87% compute efficiency
4.8 ms

Pipeline Parallelism: Splitting Layers in Sequence

Pipeline parallelism (PP) takes a different approach: instead of splitting individual layers, it assigns entire groups of layers to different GPUs. GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, GPU 2 gets layers 40-59, GPU 3 gets layers 60-79, and so on.

The appeal is that inter-stage communication is just passing activation tensors at the boundary — a point-to-point send/receive, not a collective all-reduce. This makes PP much more tolerant of lower-bandwidth interconnects like InfiniBand or even ethernet, making it the natural choice for inter-node parallelism.

The Bubble Problem

The fundamental challenge with pipeline parallelism is pipeline bubbles. In a naive implementation:

  1. GPU 0 computes forward pass for micro-batch 1
  2. GPU 0 sends activations to GPU 1; GPU 0 is now idle
  3. GPU 1 computes forward on those activations; GPU 0 is still idle
  4. … and so on through the pipeline
  5. The backward pass propagates back in reverse

With PP pipeline stages and MM micro-batches, the bubble fraction (the fraction of time GPUs are idle) is:

bubble_fraction=P1M+P1\text{bubble\_fraction} = \frac{P - 1}{M + P - 1}

For P=4P=4 stages and M=1M=1 micro-batch, the bubble fraction is 3/4=75%3/4 = 75\% — three out of four GPUs are idle at any given time. This is catastrophic.

Pipeline Bubble Fraction vs. Number of Micro-batches (P=4 Stages)

line
Metric 1248163264
Bubble fraction
0.75
0.6
0.43
0.27
0.16
0.09
0.05

The solution is to split the mini-batch into many micro-batches and pipeline them through the stages. With enough micro-batches, the bubbles become a small fraction of total time.

1F1B Schedule

The 1F1B (one-forward-one-backward) schedule, used in PipeDream and Megatron-LM, interleaves forward and backward passes of different micro-batches to reduce memory and improve efficiency.

The schedule works as follows:

  1. Warmup phase: Each stage performs forward passes for a sufficient number of micro-batches to fill the pipeline
  2. Steady state: Each stage alternates — one forward pass, one backward pass (1F1B)
  3. Cooldown phase: Remaining backward passes drain the pipeline

The advantage of 1F1B over the naive “all-forward-then-all-backward” approach: it limits the number of in-flight micro-batches, so each GPU only needs to store activations for PP micro-batches (not MM), dramatically reducing activation memory.

The bubble fraction with 1F1B is still:

bubble_fraction=P1M\text{bubble\_fraction} = \frac{P - 1}{M}

(where MM is the total number of micro-batches). To keep bubbles below 5%, you need M20(P1)M \geq 20(P-1). For P=4P=4, that means M60M \geq 60 micro-batches — which requires a very large global batch size.

Interleaved Pipeline (Megatron-LM v2)

Megatron-LM introduced an interleaved pipeline schedule where each GPU holds multiple non-contiguous groups of layers (called virtual stages). Instead of GPU 0 holding layers 0-19 contiguously, it might hold layers 0-4 and layers 40-44.

With VV virtual stages per GPU, the bubble fraction becomes:

bubble_fraction=P1VM\text{bubble\_fraction} = \frac{P - 1}{V \cdot M}

Doubling VV halves the bubble. The trade-off: more pipeline stages means more inter-stage communication (more activation transfers).

📊

Pipeline Bubble Fraction for Different Schedules (P=4, M=16)

ScheduleVirtual StagesBubble FractionInter-stage Comms per Step
Naive (all-forward-all-backward) 1 18.8% Low
1F1B 1 18.8% Low
Interleaved (V=2) 2 9.4% 2x baseline
Interleaved (V=4) 4 4.7% 4x baseline
Note: Interleaved schedules trade communication for less bubble. The extra communication is point-to-point and can often be overlapped.

DualPipe: Bidirectional Pipeline (DeepSeek V3)

DeepSeek V3 introduced DualPipe, a bidirectional pipeline schedule that processes micro-batches from both ends of the pipeline simultaneously. One stream flows forward through stages 0 to P1P-1, while another flows backward from stage P1P-1 to 0.

The key insight: while a forward-flowing micro-batch is in the backward pass on GPU ii, a backward-flowing micro-batch can be in its forward pass on that same GPU. The computation from both streams fills what would otherwise be bubble time.

DualPipe achieves near-zero bubble fraction without requiring the enormous batch sizes that traditional schedules need. The cost is implementation complexity and the requirement that the model’s computation is roughly symmetric across stages.

ℹ️ Pipeline Parallelism Communication

PP communication is point-to-point (send/receive between adjacent stages), not collective. Each inter-stage transfer sends one activation tensor of size B×L×H×2B \times L \times H \times 2 bytes (for FP16). For Llama 3 70B with H=8192H=8192, B=4B=4, L=2048L=2048: that is 128 MB per transfer. On InfiniBand at 400 Gb/s (50 GB/s), this takes about 2.5 ms — acceptable if overlapped with computation.

When Pipeline Parallelism Wins

PP is the strategy of choice when:

  • You need to scale beyond a single NVLink domain (beyond 8 GPUs)
  • Your interconnect bandwidth is moderate (InfiniBand, not NVLink)
  • You can afford large batch sizes (many micro-batches to fill the pipeline)
  • Latency is not the primary concern (bubbles add latency)

PP is a poor choice for:

  • Low-latency inference (bubbles kill time-to-first-token)
  • Small batch sizes (bubble fraction is too high)
  • Within a node where NVLink is available (TP is strictly better intra-node)

Expert Parallelism: Distributing MoE Experts

Mixture-of-Experts (MoE) models like DeepSeek V3 (671B parameters, 37B active per token) and Mixtral use sparsely-activated expert layers. Each token is routed to a small number of experts (typically 2 out of 64 or more), making compute per token much cheaper than a dense model of the same total size.

Expert parallelism (EP) is the natural parallelism strategy for MoE models: distribute experts across GPUs, so each GPU holds a subset of the total experts.

How Expert Parallelism Works

  1. Router: A lightweight gating network decides which experts each token should be sent to
  2. All-to-all dispatch: Tokens are sent to the GPUs that hold their assigned experts. This is an all-to-all communication pattern — fundamentally different from the all-reduce in TP or the point-to-point in PP
  3. Expert compute: Each GPU processes the tokens that were routed to its local experts
  4. All-to-all combine: Results are sent back to the originating GPUs

Expert Parallel Layout (64 Experts, 8 GPUs)

Each GPU holds 8 experts. Tokens are dispatched to the correct GPU via all-to-all communication.

GPU 0: Experts 0-7 Holds weights for 8 experts Receives tokens routed to experts 0-7 from ALL GPUs
GPU 1: Experts 8-15 Holds weights for 8 experts Receives tokens routed to experts 8-15 from ALL GPUs
... Each GPU holds 8 of 64 experts Load balancing is critical — uneven routing wastes compute
GPU 7: Experts 56-63 Holds weights for 8 experts Receives tokens routed to experts 56-63 from ALL GPUs

All-to-All Communication

The all-to-all primitive is the defining characteristic of expert parallelism. Unlike all-reduce (where every GPU contributes and receives the same quantity of data), all-to-all has non-uniform traffic patterns: a GPU might send many tokens to one expert GPU and few to another, depending on the router’s decisions.

The communication volume depends on the number of tokens, expert assignment, and token hidden dimension. For a sequence of TT tokens with hidden dimension HH, each dispatched to KK experts across NN GPUs, the total all-to-all volume is approximately:

All-to-all volume2×T×K×H×2 bytes (dispatch + combine)\text{All-to-all volume} \approx 2 \times T \times K \times H \times 2 \text{ bytes (dispatch + combine)}

This is substantial for large batch sizes. DeepSeek V3 uses EP=64 across 64 GPUs, meaning every expert lives on exactly one GPU. The all-to-all communication is distributed across the entire cluster.

DeepEP: Optimized Expert Communication

DeepSeek developed DeepEP, an optimized communication library for MoE expert parallelism. Key innovations include:

  • Hook-based overlap: Communication is overlapped with computation by hooking into the forward/backward pass
  • Optimized dispatch/combine kernels: Custom CUDA kernels that fuse routing decisions with communication setup
  • Adaptive routing: Load balancing across experts to prevent communication hotspots
Expert Parallelism vs. Tensor Parallelism for MoE

For dense models, TP splits every layer across GPUs. For MoE models, EP is more efficient because only the expert layers need inter-GPU communication — the shared attention layers can use TP within a smaller group. DeepSeek V3 combines EP for expert layers with TP for attention layers.

When Expert Parallelism Wins

EP is the right choice when:

  • Your model uses MoE architecture
  • You have enough experts to distribute meaningfully (typically 8+ experts)
  • You can tolerate the all-to-all latency (or overlap it with compute)
  • Load balancing across experts is achievable (via auxiliary loss or routing constraints)

EP is a poor choice for:

  • Dense models (there are no experts to distribute)
  • Very small numbers of experts (the all-to-all overhead is not amortized)
  • Latency-critical inference with small batch sizes (all-to-all has high fixed overhead)

Context and Sequence Parallelism: Handling Long Sequences

When sequences become very long — 32K, 128K, or even 1M tokens — the memory consumed by attention’s KV cache and the quadratic cost of attention computation become bottlenecks that no other parallelism strategy addresses directly.

Context parallelism (CP) and sequence parallelism (SP) split the sequence dimension across GPUs.

Ring Attention

Ring Attention distributes the sequence across GPUs: GPU ii holds tokens [iL/N,(i+1)L/N)[i \cdot L/N, (i+1) \cdot L/N) of the sequence. Each GPU computes attention for its local query tokens against the full KV cache.

The trick: the KV cache is passed around in a ring topology. At each step:

  1. GPU ii computes attention between its local queries and the current KV block
  2. GPU ii sends its KV block to GPU (i+1)modN(i+1) \mod N and receives from GPU (i1)modN(i-1) \mod N
  3. Repeat for NN steps until every GPU has seen all KV blocks

The key property: KV transfer between adjacent GPUs can be overlapped with the attention computation on the current KV block. If the compute is slower than the communication (which it usually is for long sequences), the ring communication is fully hidden.

Memory per GPU:

KV cache per GPU=LN×2×H×num_layers×2 bytes\text{KV cache per GPU} = \frac{L}{N} \times 2 \times H \times \text{num\_layers} \times 2 \text{ bytes}

For a 1M token sequence with H=8192H=8192, 80 layers, FP16: total KV cache is 2.6 TB\approx 2.6\text{ TB}. With 8 GPUs: 328 GB per GPU (still needs further sharding or quantization). With 32 GPUs: 82 GB per GPU.

Ulysses Sequence Parallelism

DeepSpeed Ulysses takes a different approach: it splits the sequence dimension and uses all-to-all communication (similar to expert parallelism) to transpose between sequence-parallel and head-parallel views.

The workflow:

  1. Split input sequence across GPUs along the sequence dimension
  2. Before attention: all-to-all to redistribute from [sequence-split, all-heads] to [all-sequence, head-split]
  3. Compute attention (now each GPU has all sequence positions but only some heads)
  4. After attention: all-to-all to redistribute back to [sequence-split, all-heads]

Ulysses requires 4 all-to-all operations per transformer layer (2 for forward, 2 for backward), but each all-to-all is smaller than Ring Attention’s total communication.

📊

Context Parallelism Comparison

MethodCommunication PatternCommunication VolumeOverlap PotentialBest For
Ring Attention Ring of point-to-point KV passes ~KV cache size per ring step High (overlap with attention compute) Very long sequences, GQA/MQA models
Ulysses All-to-all (sequence <-> head split) ~4 all-to-all per layer Moderate Moderate sequences, many heads
Hybrid Ring + Ulysses Both Varies High Extremely long sequences

When to Use Context Parallelism

Context parallelism becomes necessary when:

  • Sequence length exceeds 32K tokens and the KV cache does not fit on a single GPU (even with quantization)
  • You are training on very long documents (128K+) and activation memory for attention is the bottleneck
  • Other parallelism dimensions are already saturated

For most current workloads (sequences under 8K), context parallelism is unnecessary. Standard tensor and data parallelism handle the memory and compute requirements.

ℹ️ Sequence Parallelism (Megatron) vs. Context Parallelism

Megatron-LM’s “sequence parallelism” is different from context parallelism. Megatron SP splits the sequence dimension only for LayerNorm and Dropout operations (which are replicated in TP), converting redundant computation to distributed computation. It uses all-gather and reduce-scatter (not ring or all-to-all). This is a complement to TP, not a separate parallelism axis. Context parallelism (Ring Attention, Ulysses) splits the entire attention computation along the sequence dimension and is a separate parallelism axis.

Combining Parallelisms: 3D, 4D, and 5D Parallel Training

No single parallelism strategy is sufficient for training the largest models. Real production training runs combine multiple strategies, each operating at a different granularity and exploiting a different level of the hardware hierarchy.

The Hardware Hierarchy

Modern GPU clusters have a hierarchy of interconnects with dramatically different bandwidths:

📊

Interconnect Bandwidth Hierarchy (H100 DGX Cluster)

LevelConnectsBandwidthLatencyBest Parallelism
Intra-GPU SMs within one GPU ~3 TB/s (HBM) ~ns None needed
Intra-node NVLink 8 GPUs within a node 900 GB/s bidirectional ~1 us Tensor Parallelism (TP)
Inter-node InfiniBand Nodes in a cluster 400 Gb/s = 50 GB/s ~1-5 us Pipeline Parallelism (PP)
Inter-rack / multi-hop Distant nodes 400 Gb/s, higher latency ~5-20 us Data Parallelism (DP/FSDP)

The principle: match parallelism strategy to interconnect bandwidth.

  • TP requires the highest bandwidth (all-reduce every layer) → use within NVLink domain
  • PP requires moderate bandwidth (point-to-point at stage boundaries) → use across nearby nodes
  • DP/FSDP requires the least frequent communication (once per step for gradient sync) → use across the cluster
  • EP requires all-to-all (for MoE token dispatch) → typically inter-node, overlapped with compute

3D Parallelism (TP + PP + DP)

The classic combination for large dense models:

Total GPUs=TP×PP×DP\text{Total GPUs} = \text{TP} \times \text{PP} \times \text{DP}

Example: Training Llama 3 70B on 256 A100 GPUs:

  • TP=8: Split each layer across 8 GPUs within a node (NVLink)
  • PP=4: Split the 80 layers into 4 pipeline stages across 4 nodes (InfiniBand)
  • DP=8: 8 independent pipeline replicas processing different data shards

This gives 8×4×8=2568 \times 4 \times 8 = 256 GPUs. Each GPU holds 1/81/8 of each layer’s weights (TP), 1/41/4 of the total layers (PP), and processes 1/81/8 of the global batch (DP).

4D/5D Parallelism (Adding EP and CP)

For MoE models and long-context training, additional parallelism axes come into play:

Total GPUs=TP×PP×DP×EP×CP\text{Total GPUs} = \text{TP} \times \text{PP} \times \text{DP} \times \text{EP} \times \text{CP}

DeepSeek V3 training configuration (2048 H800 GPUs):

  • EP=64 for expert layers (each of 256 experts distributed across 64 GPUs)
  • TP=4 for attention layers within a node
  • PP=16 pipeline stages across nodes
  • DP=2 for data parallelism

Note that EP and DP can share the same GPU dimension — a GPU that handles data shard ii for dense layers handles expert group jj for MoE layers. The total GPU count is EP×PP×DPeffective=64×16×2=2048\text{EP} \times \text{PP} \times \text{DP}_{\text{effective}} = 64 \times 16 \times 2 = 2048.

📊

Real-World Multi-Dimensional Parallelism Configurations

ModelTotal GPUsTPPPDPEPCP
Llama 3 70B 256 A100s 8 4 8 -- --
Llama 3 405B 16384 H100s 8 16 128 -- --
DeepSeek V3 2048 H800s 4 16 2 64 --
GPT-4 (estimated) ~10000+ GPUs 8 ~64 ~20+ MoE --
Long-context training (128K) 512 H100s 8 4 4 -- 4
Note: Configurations are approximate and may vary by training phase. Many teams adjust parallelism dimensions dynamically.

How to Choose: A Decision Process

The order in which you add parallelism dimensions matters. Here is the recommended priority:

Step 1: Tensor Parallelism first (for latency)

If the model does not fit on one GPU, start with TP within a single NVLink-connected node. TP=8 for 8-GPU nodes. This gives the lowest latency because all-reduce on NVLink is fast and there are no pipeline bubbles.

Step 2: Pipeline Parallelism second (for inter-node scaling)

If TP=8 is not enough (model still too large, or you need more GPUs), add PP across nodes. PP tolerates the lower InfiniBand bandwidth. Choose the number of stages to keep bubble fraction manageable (below 5-10%).

Step 3: Data Parallelism third (for throughput scaling)

Once the model fits via TP+PP, scale throughput with DP/FSDP across remaining GPUs. DP communication (gradient sync) happens once per step and can overlap with computation.

Step 4: Expert Parallelism (for MoE models)

If using MoE, add EP to distribute experts. EP typically replaces some of the DP dimension for expert layers.

Step 5: Context Parallelism (for long sequences)

If training on very long sequences (over 32K), add CP to split the sequence dimension.

The Rule of Thumb

TP handles the model dimension (each layer is too big). PP handles the depth dimension (too many layers). DP handles the data dimension (need more throughput). EP handles the expert dimension (MoE). CP handles the sequence dimension (long context). Each addresses a different bottleneck with a communication pattern suited to a different level of the interconnect hierarchy.

Communication Primitives Reference

All parallelism strategies reduce to a small number of collective communication primitives. Understanding these is essential for reasoning about performance.

📊

Communication Primitives Used by Each Parallelism Strategy

PrimitiveDescriptionVolume per GPUUsed By
All-reduce Sum tensors across all GPUs, result on all GPUs ~2M (ring) DDP gradients, TP layer outputs
All-gather Each GPU broadcasts its shard, all GPUs get full tensor ~M(N-1)/N FSDP parameter gathering, Megatron SP
Reduce-scatter Sum tensors, each GPU gets 1/N of result ~M(N-1)/N FSDP gradient reduction, Megatron SP
All-to-all Each GPU sends different data to each other GPU Variable Expert parallelism dispatch/combine, Ulysses SP
Point-to-point (send/recv) One GPU sends to one other GPU One tensor Pipeline parallelism stage boundaries, Ring Attention
Note: All-reduce can be decomposed into reduce-scatter + all-gather. Volume M refers to the tensor size being communicated.

A useful identity: all-reduce = reduce-scatter + all-gather. This is exactly what FSDP exploits — instead of doing an all-reduce and keeping the full gradient (wasteful), it does only the reduce-scatter (keeping 1/N1/N of the gradient) for gradient computation, and only the all-gather when the full parameter is needed for forward/backward.

When Each Parallelism Wins: A Decision Tree

Here is a practical decision tree for choosing parallelism strategies:

Does the model fit on a single GPU (including optimizer states for training)?

  • Yes: Use DDP for throughput scaling. This is the simplest and most efficient option.
  • No: Continue below.

Does the model fit on a single GPU with ZeRO-1 or ZeRO-2 (sharded optimizer)?

  • Yes: Use ZeRO-1/2 with DDP. You get memory savings with minimal communication overhead.
  • No: Continue below.

Do you have NVLink-connected GPUs (DGX-style node)?

  • Yes: Start with TP=8 within the NVLink domain. If the model still does not fit, add PP across nodes. Fill remaining GPUs with DP/FSDP.
  • No (PCIe only): Use FSDP/ZeRO-3 for sharding. Avoid TP beyond TP=2. Consider PP if you have multiple nodes.

Is the model MoE?

  • Yes: Add EP to distribute experts. Combine with TP for attention layers and PP for depth.
  • No: Stick with TP + PP + DP.

Is the sequence length over 32K tokens?

  • Yes: Add CP (Ring Attention or Ulysses) to split the sequence dimension.
  • No: Standard TP + PP + DP is sufficient.
📊

Parallelism Strategy Quick Reference

StrategyWhen to UseCommunication PatternBandwidth RequirementLatency Impact
DDP Model fits on 1 GPU, scale throughput All-reduce (1x per step) Low None
FSDP / ZeRO-3 Model does not fit, need memory savings All-gather + reduce-scatter Moderate Slight increase
Tensor Parallel Layer too large for 1 GPU, have NVLink All-reduce (2x per layer) Very High (NVLink) Low
Pipeline Parallel Model too deep, scaling beyond 1 node Point-to-point at boundaries Moderate Bubbles add latency
Expert Parallel MoE model, distribute experts All-to-all dispatch/combine Moderate-High All-to-all latency
Context Parallel Very long sequences (over 32K) Ring or all-to-all Moderate Depends on overlap

Practical Considerations

Activation Checkpointing

Regardless of parallelism strategy, activation checkpointing (gradient checkpointing) is almost always used for large model training. Instead of storing all activations from the forward pass for use in the backward pass, you recompute them. This trades ~33% extra compute for a dramatic reduction in activation memory.

With FSDP + activation checkpointing, the memory per GPU is approximately:

Memory16PN+batch_activations\text{Memory} \approx \frac{16P}{N} + \text{batch\_activations}

where the activation memory depends on batch size, sequence length, and how aggressively you checkpoint.

Mixed Precision

Modern training uses BF16 or FP16 for forward/backward computation and FP32 for optimizer states and master weights. This halves the communication volume for TP all-reduces (since activations are in FP16) and halves the parameter memory. The optimizer states remain in FP32 for numerical stability, which is why ZeRO’s optimizer state sharding is so impactful.

Overlapping Communication and Computation

The key to high GPU utilization is overlapping communication with computation:

  • DDP: Gradient all-reduce is overlapped with backward computation (bucketed all-reduce starts as soon as each gradient bucket is ready)
  • FSDP: All-gather for the next layer can be prefetched during the current layer’s computation
  • TP: All-reduce can partially overlap with subsequent layer norm computation
  • PP: Activation transfer to the next stage overlaps with the current stage’s remaining computation
  • EP: Token dispatch can overlap with attention computation on the dense layers

When communication is fully overlapped with computation, it effectively “disappears” from the critical path. This is why real-world scaling efficiency is often better than what communication volume alone would suggest.

Fault Tolerance at Scale

With 2048+ GPUs, failures are not rare events — they are expected. A single GPU failure every few hours is normal at this scale. Production training systems need:

  • Checkpointing: Periodic saves of model state. With FSDP, each GPU saves its local shard (fast), and shards are assembled during recovery.
  • Elastic training: The ability to continue training with fewer GPUs (adjusting DP degree) when a node fails, then scale back up when it recovers.
  • Communication timeout handling: Detecting and recovering from NCCL communication hangs caused by GPU or network failures.

Putting It All Together: A Worked Example

Suppose you want to train a 70B dense model on 512 H100 GPUs (64 nodes of 8 GPUs each). Here is how you would configure parallelism:

Model analysis:

  • 70B parameters, 80 transformer layers
  • FP16 weights: 140 GB
  • With Adam optimizer (FP32): 16 bytes/param = 1.12 TB total
  • Target: global batch size of 4M tokens (1024 sequences of 4096 tokens)

Configuration:

  • TP=8: Within each 8-GPU node (NVLink). Each GPU holds 1/81/8 of each layer’s weights.
  • PP=4: Across 4 nodes. 20 layers per stage. Requires ~16 micro-batches to keep bubble below 5%.
  • DP=16: 16 independent TP+PP groups, each processing 1/161/16 of the global batch.

Verification: 8×4×16=5128 \times 4 \times 16 = 512 GPUs. Each GPU’s memory:

  • Parameters: 140 GB/8 (TP)/4 (PP)=4.4 GB140\text{ GB} / 8\text{ (TP)} / 4\text{ (PP)} = 4.4\text{ GB}
  • Optimizer states (with ZeRO-1 across DP): 840 GB/8/4/16=1.6 GB840\text{ GB} / 8 / 4 / 16 = 1.6\text{ GB}
  • Activations (with checkpointing): ~10-20 GB (depends on micro-batch size and sequence length)
  • Total: ~20-30 GB per GPU — well within H100’s 80 GB

Communication:

  • TP: 2 all-reduces per layer, 20 layers per stage = 40 all-reduces per step on NVLink (~900 GB/s) — microseconds per step
  • PP: 1 activation transfer per micro-batch at each stage boundary on InfiniBand (50 GB/s) — a few milliseconds, overlapped with compute
  • DP: 1 gradient all-reduce per step across 16 replicas on InfiniBand — overlapped with backward computation

This configuration achieves approximately 40-50% Model FLOPs Utilization (MFU), meaning 40-50% of the theoretical peak FLOPS are used for useful computation. The rest is overhead from communication, memory operations, and pipeline bubbles.

ℹ️ Model FLOPs Utilization (MFU)

MFU measures what fraction of the GPU’s theoretical peak compute is used for model computation (excluding communication, memory copies, etc.). State-of-the-art training runs achieve 40-55% MFU. DeepSeek V3 reportedly achieved ~55% MFU on their 2048-GPU cluster. Higher MFU means less money wasted on idle silicon.

Summary

📊

Complete Parallelism Strategy Summary

DimensionWhat It SplitsCommunicationWhere to UseScales To
Data (DDP) Training data across replicas All-reduce gradients Everywhere 1000s of GPUs
Data (FSDP/ZeRO) Data + params + optimizer + grads All-gather + reduce-scatter Everywhere 1000s of GPUs
Tensor (TP) Weight matrices within each layer All-reduce per layer Intra-node (NVLink) 8 GPUs per group
Pipeline (PP) Layers across pipeline stages Point-to-point activations Inter-node (InfiniBand) 16-64 stages
Expert (EP) MoE experts across GPUs All-to-all dispatch/combine Across cluster Num experts
Context (CP) Sequence dimension Ring or all-to-all Across nodes Sequence length / chunk

The art of distributed training is choosing the right combination of these strategies for your specific model, hardware, and requirements. The general recipe:

  1. TP=8 intra-node to split layers across NVLink-connected GPUs
  2. PP across nodes to handle model depth beyond what TP covers
  3. DP/FSDP across the cluster to scale throughput
  4. EP for MoE models to distribute experts
  5. CP for long sequences when the sequence dimension is the memory bottleneck

Each parallelism dimension addresses a different bottleneck (layer size, model depth, throughput, expert count, sequence length) and maps to a different level of the hardware interconnect hierarchy. Getting this mapping right is the difference between 50% MFU and 20% MFU — the difference between a training run that finishes in weeks and one that takes months.