Modern large language models do not fit on a single GPU. Llama 3 70B in FP16 requires roughly 140 GB of memory just for its parameters — and a single A100 has 80 GB. A 405B model needs over 800 GB. Training is even worse: optimizer states, gradients, and activations can multiply the memory footprint by 4-16x. You must distribute the work across multiple GPUs.
But how you split the work matters enormously. A naive approach can leave most GPUs idle, drown the interconnect in communication, or introduce latency-killing pipeline bubbles. This guide covers every major parallelism strategy used in production today — from the simplest (data parallelism) to the most exotic (expert parallelism, bidirectional pipelines) — and shows you how to combine them for large-scale training runs across thousands of GPUs.
Why Parallelism Is Unavoidable
Let us start with the arithmetic that forces our hand.
Memory Requirements for Modern LLMs (FP16 Parameters Only)
| Model | Parameters | FP16 Weight Memory | A100 80GB GPUs Needed (weights only) |
|---|---|---|---|
| Llama 3 8B | 8B | 16 GB | 1 |
| Llama 3 70B | 70B | 140 GB | 2 |
| Llama 3 405B | 405B | 810 GB | 11 |
| GPT-4 (estimated) | ~1.8T MoE | ~3.6 TB | 45+ |
| DeepSeek V3 | 671B MoE (37B active) | ~1.3 TB | 17+ |
For training, the memory multiplier is brutal. With Adam optimizer in mixed precision, each parameter costs:
- 2 bytes for the FP16 parameter
- 2 bytes for the FP16 gradient
- 4 bytes for the FP32 master weight (for numerical stability)
- 4 bytes for the FP32 first moment (Adam)
- 4 bytes for the FP32 second moment (Adam)
That is 16 bytes per parameter. A 70B model needs 1.12 TB just for parameters + optimizer + gradients — before a single activation is stored. This is 14 A100s worth of memory, minimum.
A single GPU cannot hold a 70B model for training. Even for inference, FP16 weights alone exceed one A100’s capacity. Parallelism is not optional — it is a hard requirement imposed by physics and economics.
The question is never “should I parallelize?” but rather “which combination of parallelism strategies minimizes total training time while fitting within my hardware constraints?”
Data Parallelism: The Foundation
Data parallelism is the simplest and most widely used strategy. The idea: replicate the entire model on every GPU, split the training data into shards, and have each GPU process a different mini-batch. After computing gradients locally, GPUs synchronize gradients so every replica takes the same optimizer step.
Distributed Data Parallel (DDP)
PyTorch’s DistributedDataParallel (DDP) is the workhorse of data parallelism. Each GPU (each “rank”) holds a complete copy of the model and processes a different portion of the mini-batch.
The training loop looks like:
- Each GPU receives a different mini-batch of data
- Each GPU runs forward pass independently
- Each GPU computes gradients independently
- All-reduce across all GPUs to average the gradients
- Each GPU applies the same optimizer step (parameters stay in sync)
The critical communication primitive is all-reduce: every GPU contributes its gradient tensor, and every GPU receives the sum (or average) of all contributions. Conceptually, each GPU sends bytes and receives bytes, where is the size of the model’s gradients.
In practice, all-reduce is implemented as a ring all-reduce or tree all-reduce that requires each GPU to send and receive bytes total (where is the number of GPUs). For large , this approaches — the volume is independent of GPU count.
DDP Memory Layout per GPU
Every GPU holds a full copy of everything. N GPUs = N copies of the model.
The problem with DDP is obvious from the memory layout: every GPU holds a complete copy of all parameters, gradients, and optimizer states. With GPUs, you have redundant copies of the optimizer states alone. For a 70B model, that is of optimizer state across the cluster — an enormous waste.
DDP uses all-reduce for gradient synchronization. The total communication volume per step is approximately bytes (where is gradient size), independent of the number of GPUs. DDP overlaps communication with backward computation by bucketing gradients and launching all-reduce as soon as each bucket is ready.
DDP works well when the model fits on a single GPU. You get near-linear throughput scaling because the only overhead is gradient all-reduce, which overlaps with computation. But for models that exceed single-GPU memory, DDP is impossible — you cannot even load the model.
ZeRO: Eliminating Redundancy
The Zero Redundancy Optimizer (ZeRO), introduced by Microsoft DeepSpeed, attacks the memory redundancy in DDP. The key insight: if every GPU holds identical optimizer states, gradients, and parameters, why not partition them across GPUs so each GPU holds only of the total?
ZeRO comes in three stages, each eliminating more redundancy:
ZeRO Stage 1 (ZeRO-1): Partition Optimizer States
Instead of every GPU holding all of Adam’s and tensors, each GPU is responsible for of the optimizer states. After all-reduce of gradients, each GPU updates only its assigned parameter partition using its local optimizer states.
Memory per GPU for a model with parameters:
For GPUs and a 70B model: . Still too large for one GPU, but the optimizer state dropped from 840 GB to 13.1 GB per GPU.
ZeRO Stage 2 (ZeRO-2): Partition Gradients + Optimizer States
Each GPU holds only of the gradients in addition to of the optimizer states. Instead of all-reduce (which leaves a full gradient copy on every GPU), ZeRO-2 uses reduce-scatter: each GPU receives only the gradient shard it is responsible for.
For and 70B: . Getting closer but still does not fit on one A100.
ZeRO Stage 3 (ZeRO-3): Partition Everything
The parameters themselves are also partitioned. Each GPU holds only of the parameters, gradients, and optimizer states. When a layer needs the full parameter tensor for forward or backward computation, it performs an all-gather to collect all shards, computes, and then discards the non-local shards.
For and 70B: per GPU. Now it fits comfortably on an A100.
ZeRO Memory Savings per GPU (70B Model, FP16 + Adam, N=64 GPUs)
| Strategy | Params (GB) | Gradients (GB) | Optimizer (GB) | Total (GB) | Fits on A100 80GB? |
|---|---|---|---|---|---|
| DDP (no ZeRO) | 140 | 140 | 840 | 1120 | No |
| ZeRO-1 | 140 | 140 | 13.1 | 293 | No |
| ZeRO-2 | 140 | 2.2 | 13.1 | 155 | No |
| ZeRO-3 | 2.2 | 2.2 | 13.1 | 17.5 | Yes |
ZeRO-1 has the same communication volume as DDP (one all-reduce per step). ZeRO-2 replaces all-reduce with reduce-scatter (half the volume: instead of ). ZeRO-3 adds all-gather operations for forward and backward passes, roughly tripling total communication volume compared to DDP. The memory savings come at a communication cost.
FSDP: PyTorch’s Native ZeRO-3
Fully Sharded Data Parallelism (FSDP) is PyTorch’s native implementation of ZeRO-3. The semantics are identical: parameters are sharded across GPUs, all-gathered on demand for computation, and discarded after use.
FSDP’s communication pattern per training step:
- Forward pass: For each layer, all-gather parameter shards from all GPUs, compute the forward pass, then discard non-local shards
- Backward pass: For each layer, all-gather parameters again (or recompute via activation checkpointing), compute gradients, then reduce-scatter to distribute gradient shards
- Optimizer step: Each GPU updates only its local parameter shard using its local gradient shard and optimizer states
The communication primitives:
- All-gather: Each GPU broadcasts its shard; every GPU receives the full tensor. Volume per GPU: sent, received.
- Reduce-scatter: Each GPU sends partial gradients; each receives its reduced shard. Volume per GPU: sent, received.
Total communication volume per step for FSDP/ZeRO-3 is approximately — about 1.5x that of DDP’s all-reduce. The trade-off is dramatically lower memory usage.
Communication Volume Comparison (Normalized to DDP)
line| Metric | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 / FSDP |
|---|---|---|---|---|
| Communication Volume (relative) | ||||
| Memory Usage (relative) |
When to Use DDP vs. FSDP/ZeRO
- Model fits on one GPU: Use DDP. It has the lowest communication overhead and simplest implementation.
- Model does not fit on one GPU: Use FSDP/ZeRO-3. The extra communication is unavoidable.
- Model barely fits, but optimizer states blow memory: Use ZeRO-1 or ZeRO-2 for a middle ground.
Tensor Parallelism: Splitting Layers Across GPUs
Data parallelism replicates the model and splits data. Tensor parallelism does the opposite at the matrix level: it splits individual weight matrices across GPUs, so each GPU computes a portion of every layer’s output.
The Megatron-LM Column/Row Strategy
The breakthrough insight from the Megatron-LM paper is that you can split the two linear layers in a transformer’s FFN block in a complementary way that minimizes communication.
A standard transformer FFN block computes:
where and (with being the hidden dimension).
Megatron-LM splits this across GPUs as follows:
First linear (column-parallel): Split along its output dimension (columns). Each GPU holds and computes:
Activation function: Apply (SiLU/GeLU) element-wise on each GPU’s local . Since the activation is element-wise, no communication is needed.
Second linear (row-parallel): Split along its input dimension (rows). Each GPU holds and computes:
Now is a partial sum — each GPU has computed part of the final output. An all-reduce sums these partials:
Megatron-LM Column/Row Parallel Strategy
FFN split: first linear is column-parallel (split output dim), second is row-parallel (split input dim)
Column-parallel followed by row-parallel means the activation function is computed on already-partitioned data — no all-reduce needed between the two linears. The only synchronization point is after the row-parallel matmul, where partial sums from each GPU are combined via all-reduce. This halves the number of all-reduces compared to a naive split where you would need to gather outputs after every linear layer.
Attention Head Partitioning
For multi-head attention, tensor parallelism splits attention heads across GPUs. Each GPU handles query heads and the corresponding KV heads.
Attention Head Distribution (Llama 3 70B: 64 Q heads, 8 KV heads)
| TP Degree | Q Heads per GPU | KV Heads per GPU | Notes |
|---|---|---|---|
| TP=1 | 64 | 8 | All heads on one GPU |
| TP=2 | 32 | 4 | Clean split |
| TP=4 | 16 | 2 | Clean split |
| TP=8 | 8 | 1 | Minimum: 1 KV head per GPU |
After the attention output projection (which is also split as a row-parallel linear), another all-reduce combines the partial results. So each transformer layer requires 2 all-reduce operations in the forward pass:
- One after the attention output projection
- One after the FFN down-projection
And correspondingly, 2 all-reduce operations in the backward pass, for a total of 4 all-reduces per layer per training step.
Communication Cost Analysis
TP Communication per Transformer Layer (Llama 3 70B, FP16)
| Component | All-Reduces per Direction | Data per All-Reduce | Total per Layer |
|---|---|---|---|
| Attention output projection | 1 | B x L x 8192 x 2 bytes | B x L x 16 KB |
| FFN down-projection | 1 | B x L x 8192 x 2 bytes | B x L x 16 KB |
| Total per layer (forward) | 2 | -- | B x L x 32 KB |
| Total per model (80 layers, forward + backward) | 320 | -- | B x L x 5 MB |
The real question is: how long do these all-reduces take relative to the compute?
For decode (L=1, B=32): 12.8 MB total communication per step. On NVLink at 600 GB/s (H100), this is about 21 microseconds — utterly negligible compared to the milliseconds of compute.
For prefill (L=2048, B=4): roughly 1.3 GB total. On NVLink 4.0: about 2 ms. On PCIe Gen4 at 32 GB/s: about 40 ms. This is why tensor parallelism requires NVLink.
TP All-Reduce Overhead by Interconnect (Llama 3 70B, Prefill 2048 Tokens)
(ms)TP Scaling Efficiency
Tensor Parallelism Scaling (Llama 3 70B, A100 NVLink)
| TP Degree | Memory per GPU | Throughput (tok/s) | Scaling Efficiency | Notes |
|---|---|---|---|---|
| TP=1 | 140+ GB (does not fit) | -- | -- | Requires model parallelism |
| TP=2 | ~75 GB | 680 | Baseline | Fits on 80 GB A100 |
| TP=4 | ~40 GB | 1250 | 92% | Good scaling |
| TP=8 | ~22 GB | 2200 | 81% | Communication starts to matter |
The practical limit for tensor parallelism is TP=8 — which conveniently matches the number of GPUs connected by NVLink within a single node (DGX A100 or DGX H100). Going beyond TP=8 would require crossing the node boundary, where bandwidth drops from 600+ GB/s (NVLink) to 50-100 GB/s (InfiniBand), making TP communication a severe bottleneck.
At PCIe bandwidths (~50 GB/s), tensor parallel communication can easily consume 30-50% of total step time for large prefills. NVLink at 600-900 GB/s brings this down to 2-5%. If you do not have NVLink, do not use tensor parallelism beyond TP=2 — prefer pipeline parallelism instead.
Implementation Pattern
# Simplified Megatron-LM style TP for FFN
class TensorParallelFFN(nn.Module):
def __init__(self, d_model, d_ff, tp_rank, tp_world):
super().__init__()
# Column-parallel: split output dim
self.up_proj = nn.Linear(d_model, d_ff // tp_world, bias=False)
# Row-parallel: split input dim
self.down_proj = nn.Linear(d_ff // tp_world, d_model, bias=False)
self.tp_world = tp_world
def forward(self, x):
h = F.silu(self.up_proj(x)) # Local compute, no comm
out = self.down_proj(h) # Local partial result
dist.all_reduce(out) # Sum partials across TP group
return out
Time Breakdown: TP=4 Llama 3 70B Decode Step (A100 NVLink)
(ms)Pipeline Parallelism: Splitting Layers in Sequence
Pipeline parallelism (PP) takes a different approach: instead of splitting individual layers, it assigns entire groups of layers to different GPUs. GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, GPU 2 gets layers 40-59, GPU 3 gets layers 60-79, and so on.
The appeal is that inter-stage communication is just passing activation tensors at the boundary — a point-to-point send/receive, not a collective all-reduce. This makes PP much more tolerant of lower-bandwidth interconnects like InfiniBand or even ethernet, making it the natural choice for inter-node parallelism.
The Bubble Problem
The fundamental challenge with pipeline parallelism is pipeline bubbles. In a naive implementation:
- GPU 0 computes forward pass for micro-batch 1
- GPU 0 sends activations to GPU 1; GPU 0 is now idle
- GPU 1 computes forward on those activations; GPU 0 is still idle
- … and so on through the pipeline
- The backward pass propagates back in reverse
With pipeline stages and micro-batches, the bubble fraction (the fraction of time GPUs are idle) is:
For stages and micro-batch, the bubble fraction is — three out of four GPUs are idle at any given time. This is catastrophic.
Pipeline Bubble Fraction vs. Number of Micro-batches (P=4 Stages)
line| Metric | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|---|---|
| Bubble fraction |
The solution is to split the mini-batch into many micro-batches and pipeline them through the stages. With enough micro-batches, the bubbles become a small fraction of total time.
1F1B Schedule
The 1F1B (one-forward-one-backward) schedule, used in PipeDream and Megatron-LM, interleaves forward and backward passes of different micro-batches to reduce memory and improve efficiency.
The schedule works as follows:
- Warmup phase: Each stage performs forward passes for a sufficient number of micro-batches to fill the pipeline
- Steady state: Each stage alternates — one forward pass, one backward pass (1F1B)
- Cooldown phase: Remaining backward passes drain the pipeline
The advantage of 1F1B over the naive “all-forward-then-all-backward” approach: it limits the number of in-flight micro-batches, so each GPU only needs to store activations for micro-batches (not ), dramatically reducing activation memory.
The bubble fraction with 1F1B is still:
(where is the total number of micro-batches). To keep bubbles below 5%, you need . For , that means micro-batches — which requires a very large global batch size.
Interleaved Pipeline (Megatron-LM v2)
Megatron-LM introduced an interleaved pipeline schedule where each GPU holds multiple non-contiguous groups of layers (called virtual stages). Instead of GPU 0 holding layers 0-19 contiguously, it might hold layers 0-4 and layers 40-44.
With virtual stages per GPU, the bubble fraction becomes:
Doubling halves the bubble. The trade-off: more pipeline stages means more inter-stage communication (more activation transfers).
Pipeline Bubble Fraction for Different Schedules (P=4, M=16)
| Schedule | Virtual Stages | Bubble Fraction | Inter-stage Comms per Step |
|---|---|---|---|
| Naive (all-forward-all-backward) | 1 | 18.8% | Low |
| 1F1B | 1 | 18.8% | Low |
| Interleaved (V=2) | 2 | 9.4% | 2x baseline |
| Interleaved (V=4) | 4 | 4.7% | 4x baseline |
DualPipe: Bidirectional Pipeline (DeepSeek V3)
DeepSeek V3 introduced DualPipe, a bidirectional pipeline schedule that processes micro-batches from both ends of the pipeline simultaneously. One stream flows forward through stages 0 to , while another flows backward from stage to 0.
The key insight: while a forward-flowing micro-batch is in the backward pass on GPU , a backward-flowing micro-batch can be in its forward pass on that same GPU. The computation from both streams fills what would otherwise be bubble time.
DualPipe achieves near-zero bubble fraction without requiring the enormous batch sizes that traditional schedules need. The cost is implementation complexity and the requirement that the model’s computation is roughly symmetric across stages.
PP communication is point-to-point (send/receive between adjacent stages), not collective. Each inter-stage transfer sends one activation tensor of size bytes (for FP16). For Llama 3 70B with , , : that is 128 MB per transfer. On InfiniBand at 400 Gb/s (50 GB/s), this takes about 2.5 ms — acceptable if overlapped with computation.
When Pipeline Parallelism Wins
PP is the strategy of choice when:
- You need to scale beyond a single NVLink domain (beyond 8 GPUs)
- Your interconnect bandwidth is moderate (InfiniBand, not NVLink)
- You can afford large batch sizes (many micro-batches to fill the pipeline)
- Latency is not the primary concern (bubbles add latency)
PP is a poor choice for:
- Low-latency inference (bubbles kill time-to-first-token)
- Small batch sizes (bubble fraction is too high)
- Within a node where NVLink is available (TP is strictly better intra-node)
Expert Parallelism: Distributing MoE Experts
Mixture-of-Experts (MoE) models like DeepSeek V3 (671B parameters, 37B active per token) and Mixtral use sparsely-activated expert layers. Each token is routed to a small number of experts (typically 2 out of 64 or more), making compute per token much cheaper than a dense model of the same total size.
Expert parallelism (EP) is the natural parallelism strategy for MoE models: distribute experts across GPUs, so each GPU holds a subset of the total experts.
How Expert Parallelism Works
- Router: A lightweight gating network decides which experts each token should be sent to
- All-to-all dispatch: Tokens are sent to the GPUs that hold their assigned experts. This is an all-to-all communication pattern — fundamentally different from the all-reduce in TP or the point-to-point in PP
- Expert compute: Each GPU processes the tokens that were routed to its local experts
- All-to-all combine: Results are sent back to the originating GPUs
Expert Parallel Layout (64 Experts, 8 GPUs)
Each GPU holds 8 experts. Tokens are dispatched to the correct GPU via all-to-all communication.
All-to-All Communication
The all-to-all primitive is the defining characteristic of expert parallelism. Unlike all-reduce (where every GPU contributes and receives the same quantity of data), all-to-all has non-uniform traffic patterns: a GPU might send many tokens to one expert GPU and few to another, depending on the router’s decisions.
The communication volume depends on the number of tokens, expert assignment, and token hidden dimension. For a sequence of tokens with hidden dimension , each dispatched to experts across GPUs, the total all-to-all volume is approximately:
This is substantial for large batch sizes. DeepSeek V3 uses EP=64 across 64 GPUs, meaning every expert lives on exactly one GPU. The all-to-all communication is distributed across the entire cluster.
DeepEP: Optimized Expert Communication
DeepSeek developed DeepEP, an optimized communication library for MoE expert parallelism. Key innovations include:
- Hook-based overlap: Communication is overlapped with computation by hooking into the forward/backward pass
- Optimized dispatch/combine kernels: Custom CUDA kernels that fuse routing decisions with communication setup
- Adaptive routing: Load balancing across experts to prevent communication hotspots
For dense models, TP splits every layer across GPUs. For MoE models, EP is more efficient because only the expert layers need inter-GPU communication — the shared attention layers can use TP within a smaller group. DeepSeek V3 combines EP for expert layers with TP for attention layers.
When Expert Parallelism Wins
EP is the right choice when:
- Your model uses MoE architecture
- You have enough experts to distribute meaningfully (typically 8+ experts)
- You can tolerate the all-to-all latency (or overlap it with compute)
- Load balancing across experts is achievable (via auxiliary loss or routing constraints)
EP is a poor choice for:
- Dense models (there are no experts to distribute)
- Very small numbers of experts (the all-to-all overhead is not amortized)
- Latency-critical inference with small batch sizes (all-to-all has high fixed overhead)
Context and Sequence Parallelism: Handling Long Sequences
When sequences become very long — 32K, 128K, or even 1M tokens — the memory consumed by attention’s KV cache and the quadratic cost of attention computation become bottlenecks that no other parallelism strategy addresses directly.
Context parallelism (CP) and sequence parallelism (SP) split the sequence dimension across GPUs.
Ring Attention
Ring Attention distributes the sequence across GPUs: GPU holds tokens of the sequence. Each GPU computes attention for its local query tokens against the full KV cache.
The trick: the KV cache is passed around in a ring topology. At each step:
- GPU computes attention between its local queries and the current KV block
- GPU sends its KV block to GPU and receives from GPU
- Repeat for steps until every GPU has seen all KV blocks
The key property: KV transfer between adjacent GPUs can be overlapped with the attention computation on the current KV block. If the compute is slower than the communication (which it usually is for long sequences), the ring communication is fully hidden.
Memory per GPU:
For a 1M token sequence with , 80 layers, FP16: total KV cache is . With 8 GPUs: 328 GB per GPU (still needs further sharding or quantization). With 32 GPUs: 82 GB per GPU.
Ulysses Sequence Parallelism
DeepSpeed Ulysses takes a different approach: it splits the sequence dimension and uses all-to-all communication (similar to expert parallelism) to transpose between sequence-parallel and head-parallel views.
The workflow:
- Split input sequence across GPUs along the sequence dimension
- Before attention: all-to-all to redistribute from [sequence-split, all-heads] to [all-sequence, head-split]
- Compute attention (now each GPU has all sequence positions but only some heads)
- After attention: all-to-all to redistribute back to [sequence-split, all-heads]
Ulysses requires 4 all-to-all operations per transformer layer (2 for forward, 2 for backward), but each all-to-all is smaller than Ring Attention’s total communication.
Context Parallelism Comparison
| Method | Communication Pattern | Communication Volume | Overlap Potential | Best For |
|---|---|---|---|---|
| Ring Attention | Ring of point-to-point KV passes | ~KV cache size per ring step | High (overlap with attention compute) | Very long sequences, GQA/MQA models |
| Ulysses | All-to-all (sequence <-> head split) | ~4 all-to-all per layer | Moderate | Moderate sequences, many heads |
| Hybrid Ring + Ulysses | Both | Varies | High | Extremely long sequences |
When to Use Context Parallelism
Context parallelism becomes necessary when:
- Sequence length exceeds 32K tokens and the KV cache does not fit on a single GPU (even with quantization)
- You are training on very long documents (128K+) and activation memory for attention is the bottleneck
- Other parallelism dimensions are already saturated
For most current workloads (sequences under 8K), context parallelism is unnecessary. Standard tensor and data parallelism handle the memory and compute requirements.
Megatron-LM’s “sequence parallelism” is different from context parallelism. Megatron SP splits the sequence dimension only for LayerNorm and Dropout operations (which are replicated in TP), converting redundant computation to distributed computation. It uses all-gather and reduce-scatter (not ring or all-to-all). This is a complement to TP, not a separate parallelism axis. Context parallelism (Ring Attention, Ulysses) splits the entire attention computation along the sequence dimension and is a separate parallelism axis.
Combining Parallelisms: 3D, 4D, and 5D Parallel Training
No single parallelism strategy is sufficient for training the largest models. Real production training runs combine multiple strategies, each operating at a different granularity and exploiting a different level of the hardware hierarchy.
The Hardware Hierarchy
Modern GPU clusters have a hierarchy of interconnects with dramatically different bandwidths:
Interconnect Bandwidth Hierarchy (H100 DGX Cluster)
| Level | Connects | Bandwidth | Latency | Best Parallelism |
|---|---|---|---|---|
| Intra-GPU | SMs within one GPU | ~3 TB/s (HBM) | ~ns | None needed |
| Intra-node NVLink | 8 GPUs within a node | 900 GB/s bidirectional | ~1 us | Tensor Parallelism (TP) |
| Inter-node InfiniBand | Nodes in a cluster | 400 Gb/s = 50 GB/s | ~1-5 us | Pipeline Parallelism (PP) |
| Inter-rack / multi-hop | Distant nodes | 400 Gb/s, higher latency | ~5-20 us | Data Parallelism (DP/FSDP) |
The principle: match parallelism strategy to interconnect bandwidth.
- TP requires the highest bandwidth (all-reduce every layer) → use within NVLink domain
- PP requires moderate bandwidth (point-to-point at stage boundaries) → use across nearby nodes
- DP/FSDP requires the least frequent communication (once per step for gradient sync) → use across the cluster
- EP requires all-to-all (for MoE token dispatch) → typically inter-node, overlapped with compute
3D Parallelism (TP + PP + DP)
The classic combination for large dense models:
Example: Training Llama 3 70B on 256 A100 GPUs:
- TP=8: Split each layer across 8 GPUs within a node (NVLink)
- PP=4: Split the 80 layers into 4 pipeline stages across 4 nodes (InfiniBand)
- DP=8: 8 independent pipeline replicas processing different data shards
This gives GPUs. Each GPU holds of each layer’s weights (TP), of the total layers (PP), and processes of the global batch (DP).
4D/5D Parallelism (Adding EP and CP)
For MoE models and long-context training, additional parallelism axes come into play:
DeepSeek V3 training configuration (2048 H800 GPUs):
- EP=64 for expert layers (each of 256 experts distributed across 64 GPUs)
- TP=4 for attention layers within a node
- PP=16 pipeline stages across nodes
- DP=2 for data parallelism
Note that EP and DP can share the same GPU dimension — a GPU that handles data shard for dense layers handles expert group for MoE layers. The total GPU count is .
Real-World Multi-Dimensional Parallelism Configurations
| Model | Total GPUs | TP | PP | DP | EP | CP |
|---|---|---|---|---|---|---|
| Llama 3 70B | 256 A100s | 8 | 4 | 8 | -- | -- |
| Llama 3 405B | 16384 H100s | 8 | 16 | 128 | -- | -- |
| DeepSeek V3 | 2048 H800s | 4 | 16 | 2 | 64 | -- |
| GPT-4 (estimated) | ~10000+ GPUs | 8 | ~64 | ~20+ | MoE | -- |
| Long-context training (128K) | 512 H100s | 8 | 4 | 4 | -- | 4 |
How to Choose: A Decision Process
The order in which you add parallelism dimensions matters. Here is the recommended priority:
Step 1: Tensor Parallelism first (for latency)
If the model does not fit on one GPU, start with TP within a single NVLink-connected node. TP=8 for 8-GPU nodes. This gives the lowest latency because all-reduce on NVLink is fast and there are no pipeline bubbles.
Step 2: Pipeline Parallelism second (for inter-node scaling)
If TP=8 is not enough (model still too large, or you need more GPUs), add PP across nodes. PP tolerates the lower InfiniBand bandwidth. Choose the number of stages to keep bubble fraction manageable (below 5-10%).
Step 3: Data Parallelism third (for throughput scaling)
Once the model fits via TP+PP, scale throughput with DP/FSDP across remaining GPUs. DP communication (gradient sync) happens once per step and can overlap with computation.
Step 4: Expert Parallelism (for MoE models)
If using MoE, add EP to distribute experts. EP typically replaces some of the DP dimension for expert layers.
Step 5: Context Parallelism (for long sequences)
If training on very long sequences (over 32K), add CP to split the sequence dimension.
TP handles the model dimension (each layer is too big). PP handles the depth dimension (too many layers). DP handles the data dimension (need more throughput). EP handles the expert dimension (MoE). CP handles the sequence dimension (long context). Each addresses a different bottleneck with a communication pattern suited to a different level of the interconnect hierarchy.
Communication Primitives Reference
All parallelism strategies reduce to a small number of collective communication primitives. Understanding these is essential for reasoning about performance.
Communication Primitives Used by Each Parallelism Strategy
| Primitive | Description | Volume per GPU | Used By |
|---|---|---|---|
| All-reduce | Sum tensors across all GPUs, result on all GPUs | ~2M (ring) | DDP gradients, TP layer outputs |
| All-gather | Each GPU broadcasts its shard, all GPUs get full tensor | ~M(N-1)/N | FSDP parameter gathering, Megatron SP |
| Reduce-scatter | Sum tensors, each GPU gets 1/N of result | ~M(N-1)/N | FSDP gradient reduction, Megatron SP |
| All-to-all | Each GPU sends different data to each other GPU | Variable | Expert parallelism dispatch/combine, Ulysses SP |
| Point-to-point (send/recv) | One GPU sends to one other GPU | One tensor | Pipeline parallelism stage boundaries, Ring Attention |
A useful identity: all-reduce = reduce-scatter + all-gather. This is exactly what FSDP exploits — instead of doing an all-reduce and keeping the full gradient (wasteful), it does only the reduce-scatter (keeping of the gradient) for gradient computation, and only the all-gather when the full parameter is needed for forward/backward.
When Each Parallelism Wins: A Decision Tree
Here is a practical decision tree for choosing parallelism strategies:
Does the model fit on a single GPU (including optimizer states for training)?
- Yes: Use DDP for throughput scaling. This is the simplest and most efficient option.
- No: Continue below.
Does the model fit on a single GPU with ZeRO-1 or ZeRO-2 (sharded optimizer)?
- Yes: Use ZeRO-1/2 with DDP. You get memory savings with minimal communication overhead.
- No: Continue below.
Do you have NVLink-connected GPUs (DGX-style node)?
- Yes: Start with TP=8 within the NVLink domain. If the model still does not fit, add PP across nodes. Fill remaining GPUs with DP/FSDP.
- No (PCIe only): Use FSDP/ZeRO-3 for sharding. Avoid TP beyond TP=2. Consider PP if you have multiple nodes.
Is the model MoE?
- Yes: Add EP to distribute experts. Combine with TP for attention layers and PP for depth.
- No: Stick with TP + PP + DP.
Is the sequence length over 32K tokens?
- Yes: Add CP (Ring Attention or Ulysses) to split the sequence dimension.
- No: Standard TP + PP + DP is sufficient.
Parallelism Strategy Quick Reference
| Strategy | When to Use | Communication Pattern | Bandwidth Requirement | Latency Impact |
|---|---|---|---|---|
| DDP | Model fits on 1 GPU, scale throughput | All-reduce (1x per step) | Low | None |
| FSDP / ZeRO-3 | Model does not fit, need memory savings | All-gather + reduce-scatter | Moderate | Slight increase |
| Tensor Parallel | Layer too large for 1 GPU, have NVLink | All-reduce (2x per layer) | Very High (NVLink) | Low |
| Pipeline Parallel | Model too deep, scaling beyond 1 node | Point-to-point at boundaries | Moderate | Bubbles add latency |
| Expert Parallel | MoE model, distribute experts | All-to-all dispatch/combine | Moderate-High | All-to-all latency |
| Context Parallel | Very long sequences (over 32K) | Ring or all-to-all | Moderate | Depends on overlap |
Practical Considerations
Activation Checkpointing
Regardless of parallelism strategy, activation checkpointing (gradient checkpointing) is almost always used for large model training. Instead of storing all activations from the forward pass for use in the backward pass, you recompute them. This trades ~33% extra compute for a dramatic reduction in activation memory.
With FSDP + activation checkpointing, the memory per GPU is approximately:
where the activation memory depends on batch size, sequence length, and how aggressively you checkpoint.
Mixed Precision
Modern training uses BF16 or FP16 for forward/backward computation and FP32 for optimizer states and master weights. This halves the communication volume for TP all-reduces (since activations are in FP16) and halves the parameter memory. The optimizer states remain in FP32 for numerical stability, which is why ZeRO’s optimizer state sharding is so impactful.
Overlapping Communication and Computation
The key to high GPU utilization is overlapping communication with computation:
- DDP: Gradient all-reduce is overlapped with backward computation (bucketed all-reduce starts as soon as each gradient bucket is ready)
- FSDP: All-gather for the next layer can be prefetched during the current layer’s computation
- TP: All-reduce can partially overlap with subsequent layer norm computation
- PP: Activation transfer to the next stage overlaps with the current stage’s remaining computation
- EP: Token dispatch can overlap with attention computation on the dense layers
When communication is fully overlapped with computation, it effectively “disappears” from the critical path. This is why real-world scaling efficiency is often better than what communication volume alone would suggest.
Fault Tolerance at Scale
With 2048+ GPUs, failures are not rare events — they are expected. A single GPU failure every few hours is normal at this scale. Production training systems need:
- Checkpointing: Periodic saves of model state. With FSDP, each GPU saves its local shard (fast), and shards are assembled during recovery.
- Elastic training: The ability to continue training with fewer GPUs (adjusting DP degree) when a node fails, then scale back up when it recovers.
- Communication timeout handling: Detecting and recovering from NCCL communication hangs caused by GPU or network failures.
Putting It All Together: A Worked Example
Suppose you want to train a 70B dense model on 512 H100 GPUs (64 nodes of 8 GPUs each). Here is how you would configure parallelism:
Model analysis:
- 70B parameters, 80 transformer layers
- FP16 weights: 140 GB
- With Adam optimizer (FP32): 16 bytes/param = 1.12 TB total
- Target: global batch size of 4M tokens (1024 sequences of 4096 tokens)
Configuration:
- TP=8: Within each 8-GPU node (NVLink). Each GPU holds of each layer’s weights.
- PP=4: Across 4 nodes. 20 layers per stage. Requires ~16 micro-batches to keep bubble below 5%.
- DP=16: 16 independent TP+PP groups, each processing of the global batch.
Verification: GPUs. Each GPU’s memory:
- Parameters:
- Optimizer states (with ZeRO-1 across DP):
- Activations (with checkpointing): ~10-20 GB (depends on micro-batch size and sequence length)
- Total: ~20-30 GB per GPU — well within H100’s 80 GB
Communication:
- TP: 2 all-reduces per layer, 20 layers per stage = 40 all-reduces per step on NVLink (~900 GB/s) — microseconds per step
- PP: 1 activation transfer per micro-batch at each stage boundary on InfiniBand (50 GB/s) — a few milliseconds, overlapped with compute
- DP: 1 gradient all-reduce per step across 16 replicas on InfiniBand — overlapped with backward computation
This configuration achieves approximately 40-50% Model FLOPs Utilization (MFU), meaning 40-50% of the theoretical peak FLOPS are used for useful computation. The rest is overhead from communication, memory operations, and pipeline bubbles.
MFU measures what fraction of the GPU’s theoretical peak compute is used for model computation (excluding communication, memory copies, etc.). State-of-the-art training runs achieve 40-55% MFU. DeepSeek V3 reportedly achieved ~55% MFU on their 2048-GPU cluster. Higher MFU means less money wasted on idle silicon.
Summary
Complete Parallelism Strategy Summary
| Dimension | What It Splits | Communication | Where to Use | Scales To |
|---|---|---|---|---|
| Data (DDP) | Training data across replicas | All-reduce gradients | Everywhere | 1000s of GPUs |
| Data (FSDP/ZeRO) | Data + params + optimizer + grads | All-gather + reduce-scatter | Everywhere | 1000s of GPUs |
| Tensor (TP) | Weight matrices within each layer | All-reduce per layer | Intra-node (NVLink) | 8 GPUs per group |
| Pipeline (PP) | Layers across pipeline stages | Point-to-point activations | Inter-node (InfiniBand) | 16-64 stages |
| Expert (EP) | MoE experts across GPUs | All-to-all dispatch/combine | Across cluster | Num experts |
| Context (CP) | Sequence dimension | Ring or all-to-all | Across nodes | Sequence length / chunk |
The art of distributed training is choosing the right combination of these strategies for your specific model, hardware, and requirements. The general recipe:
- TP=8 intra-node to split layers across NVLink-connected GPUs
- PP across nodes to handle model depth beyond what TP covers
- DP/FSDP across the cluster to scale throughput
- EP for MoE models to distribute experts
- CP for long sequences when the sequence dimension is the memory bottleneck
Each parallelism dimension addresses a different bottleneck (layer size, model depth, throughput, expert count, sequence length) and maps to a different level of the hardware interconnect hierarchy. Getting this mapping right is the difference between 50% MFU and 20% MFU — the difference between a training run that finishes in weeks and one that takes months.