Llama 3 405B requires 810 GB just for weights in BF16. An H100 has 80 GB. The model physically cannot fit on one GPU, so you must distribute it across multiple devices. Tensor parallelism splits every operation across GPUs with an all-reduce per layer — fast within a node over NVLink, catastrophic across nodes over InfiniBand because 160 all-reduces per forward pass drowns the network. Pipeline parallelism splits the model by layers: Device 0 gets layers 0-9, Device 1 gets layers 10-19, and only activation tensors cross device boundaries. Communication volume drops by 20x, but you introduce pipeline bubbles where GPUs sit idle waiting for the next micro-batch. The best schedules — PipeDream 1F1B, Megatron interleaved, DeepSeek DualPipe — minimize bubbles to under 10%. This post covers the algorithms, the bubble math, and when PP beats TP.
There are three fundamental strategies for distributing a model across devices:
- Data parallelism (DP) — replicate the full model on every device, split the data. This does not help when the model itself exceeds one device’s memory.
- Tensor parallelism (TP) — split individual operations (matrix multiplications, attention heads) across devices. Every operation requires an all-reduce across participating GPUs, so it demands extremely high-bandwidth interconnect. Within a single node using NVLink (900 GB/s bidirectional on H100), TP works well. Across nodes over InfiniBand (50-100 GB/s per link), the latency and bandwidth penalty is severe.
- Pipeline parallelism (PP) — split the model by layers. Device 0 holds layers 0-9, device 1 holds layers 10-19, and so on. Only activation tensors cross device boundaries, and they cross in one direction at a time. The communication volume is dramatically smaller than TP’s all-reduce pattern.
Tensor parallelism requires two all-reduce operations per transformer layer (one for the attention projection, one for the MLP). For a model with hidden dimension and sequence length , each all-reduce communicates bytes. Over InfiniBand at 50 GB/s effective bandwidth, this adds 100-200 microseconds of latency per layer. With 80+ layers, cross-node TP can spend more time communicating than computing. Pipeline parallelism replaces this with a single point-to-point transfer of activations between adjacent stages — orders of magnitude less communication.
Pipeline parallelism is the natural choice for inter-node distribution. You assign consecutive layers to each node, and the only data that crosses the (slow) inter-node network is the activation tensor at layer boundaries. For a transformer with hidden size and micro-batch size with sequence length , the activation tensor at each boundary is elements. Compare this to TP, which would require all-reduce of the same size tensor twice per layer across all nodes.
Communication Volume Per Training Step (80-layer Transformer, h=8192)
| Strategy | Nodes | Per-Layer Comm | Total Comm | Interconnect Need |
|---|---|---|---|---|
| TP across 8 nodes | 8 | 2 all-reduce of b*s*h | 160 all-reduces | NVLink-class |
| PP across 8 nodes | 8 | 1 point-to-point of b*s*h | 7 point-to-point | InfiniBand OK |
| TP intra-node + PP inter-node | 8 | 2 all-reduce (local) + 1 p2p | Hybrid | Optimal |
But pipeline parallelism has a fundamental problem: the pipeline bubble.
The Bubble Problem
Consider a naive pipeline with stages processing a single mini-batch. Stage 0 computes the forward pass for layers 0-19, then sends the activation to stage 1. Stage 1 computes layers 20-39, sends to stage 2, and so on. During this entire forward cascade, stages 1, 2, and 3 are idle while waiting for their input. Then during the backward pass, stages cascade in reverse. The result is that each stage is active for only a small fraction of the total time.
Naive Pipeline (P=4, single batch):
Time --> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Stage 0: | F0 | | | | | | | B0 |
Stage 1: | | F0 | | | | | B0 | |
Stage 2: | | | F0 | | | B0 | | |
Stage 3: | | | | F0 | B0 | | | |
F0 = forward micro-batch 0, B0 = backward micro-batch 0
Empty cells = GPU sitting idle (the "bubble")
The idle fraction in this naive schedule is:
For pipeline stages, this means of GPU time is wasted. You are paying for 8 GPUs but getting the throughput of slightly more than 1. This is unacceptable.
Naive Pipeline Bubble Fraction by Number of Stages
(%)The entire history of pipeline parallelism research is, in essence, the quest to shrink this bubble.
GPipe: Micro-Batching to Shrink the Bubble
GPipe (Huang et al., 2019) introduced the foundational insight: split the mini-batch into micro-batches and pipeline them through the stages. Instead of waiting for the entire forward pass to complete before starting the backward pass, you feed micro-batches into the pipeline one after another. While stage 0 is processing micro-batch 1, stage 1 is processing micro-batch 0 from the previous time step.
GPipe Schedule (P=4, M=8 micro-batches):
Time --> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ... | 19 |
Stage 0: | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | | | | | B0 |
Stage 1: | | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | | | | |
Stage 2: | | | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | | | |
Stage 3: | | | | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | B7 | |
All forward passes complete, then all backward passes run.
Bubble is only the fill and drain phases: (P-1) time slots at each end.
The key schedule is: all micro-batches execute their forward passes in pipeline fashion, then all micro-batches execute their backward passes. Gradients are accumulated across micro-batches and the optimizer step happens once per mini-batch.
The bubble fraction becomes:
This is a dramatic improvement. With stages and micro-batches:
Compare this to 87.5% with the naive approach.
GPipe Bubble Fraction: Effect of Micro-batch Count M
| Micro-batches (M) | P=4 | P=8 | P=16 | P=32 |
|---|---|---|---|---|
| M=4 | 42.9% | 63.6% | 78.9% | 88.6% |
| M=8 | 27.3% | 46.7% | 65.2% | 79.5% |
| M=16 | 15.8% | 30.4% | 48.4% | 66.0% |
| M=32 | 8.6% | 17.9% | 32.6% | 49.2% |
| M=64 | 4.5% | 9.9% | 19.0% | 32.6% |
| M=128 | 2.3% | 5.2% | 10.5% | 19.5% |
class GPipeSchedule:
"""
GPipe micro-batch pipeline schedule.
All forwards run first (fill phase), then all backwards (drain phase).
Gradient accumulation happens across micro-batches.
"""
def __init__(self, stages: int, num_microbatches: int):
self.P = stages
self.M = num_microbatches
def bubble_fraction(self) -> float:
return (self.P - 1) / (self.P - 1 + self.M)
def execute(self, mini_batch):
micro_batches = split(mini_batch, self.M)
# Phase 1: All forward passes (pipeline fill + steady state)
# Each micro-batch flows through all P stages
activations = {} # (stage, mb_idx) -> activation tensor
for mb_idx in range(self.M):
for stage in range(self.P):
input_act = micro_batches[mb_idx] if stage == 0 \
else activations[(stage - 1, mb_idx)]
activations[(stage, mb_idx)] = self.forward(stage, input_act)
# Phase 2: All backward passes (pipeline drain)
# Must store ALL activations from forward phase -- memory intensive
accumulated_grads = [zeros_like(p) for p in self.parameters()]
for mb_idx in reversed(range(self.M)):
for stage in reversed(range(self.P)):
grads = self.backward(stage, activations[(stage, mb_idx)])
accumulate(accumulated_grads, grads)
# Phase 3: Single optimizer step
self.optimizer_step(accumulated_grads)
GPipe must store activations for ALL micro-batches across ALL stages simultaneously during the forward phase, because backward passes do not begin until all forward passes complete. For micro-batches with hidden size and sequence length in BF16, each activation tensor is bytes. Even with micro-batch size , that is 64 MB per activation. Storing 32 of these per stage means 2 GB of activation memory per stage — and this is just one layer boundary. The actual activation memory across all layers in a stage is much larger.
GPipe’s Gradient Accumulation Semantics
A subtle but important property of GPipe: because all forward passes complete before any backward pass begins, the gradient computation is mathematically identical to processing the full mini-batch at once (up to floating-point ordering differences). Each micro-batch’s forward pass uses the same model weights. The accumulated gradients across micro-batches equal the gradient of the full mini-batch loss.
This is a clean semantic guarantee. The training dynamics are identical to single-device training with the same effective batch size. No weight staleness, no approximation.
PipeDream 1F1B: Interleaving Forward and Backward
PipeDream (Narayanan et al., 2019) and its successor PipeDream-Flush (also called the “1F1B schedule”) address GPipe’s memory problem by interleaving forward and backward passes. Instead of completing all forwards before starting any backward, each stage alternates: one forward, one backward, one forward, one backward.
1F1B Schedule (P=4, M=8 micro-batches):
Time --> | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ...
Stage 0: | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | F7 | ...
Stage 1: | | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | ...
Stage 2: | | | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | ...
Stage 3: | | | | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | ...
Warmup: each stage does (P - stage_id - 1) extra forwards before entering steady state.
Steady state: strictly alternating 1 forward + 1 backward.
Cooldown: remaining backwards drain the pipeline.
The schedule has three phases:
- Warmup phase: Stage performs forward passes to fill the pipeline. Stage 0 does forwards, stage 1 does , and so on. The last stage does 0 warmup forwards.
- Steady-state phase: Each stage alternates exactly one forward pass and one backward pass (hence “1F1B”). This is the bulk of the computation.
- Cooldown phase: Remaining backward passes drain the pipeline.
The bubble fraction is the same as GPipe: . The critical difference is in memory.
The Memory Advantage of 1F1B
In GPipe, all forward passes complete before any backward pass starts. This means every stage must store activations for all micro-batches simultaneously. The peak activation memory per stage scales as .
In 1F1B, once a stage enters steady state, it performs a backward pass immediately after each forward pass. The backward pass for micro-batch frees the activations for micro-batch . At any point during steady state, a stage has at most micro-batches’ activations in flight — one from each of the stages ahead of it in the warmup, plus the current one.
Since typically (you want many micro-batches to reduce the bubble), this is a major memory saving.
Activation Memory: GPipe vs 1F1B (P=4 stages)
Peak number of micro-batch activations stored simultaneously per stage
class OneFOneBSchedule:
"""
1F1B (One Forward One Backward) pipeline schedule.
Key insight: interleaving forward and backward passes limits
the number of in-flight micro-batches to P (not M).
"""
def __init__(self, stages: int, num_microbatches: int):
self.P = stages
self.M = num_microbatches
def peak_activations(self) -> int:
"""Max micro-batch activations stored at any stage."""
return self.P # Not M! This is the key advantage.
def schedule_for_stage(self, stage_id: int):
"""Generate the action sequence for a single stage."""
warmup_forwards = self.P - 1 - stage_id
steady_state_pairs = self.M - warmup_forwards
cooldown_backwards = warmup_forwards
actions = []
# Phase 1: Warmup -- only forwards
for i in range(warmup_forwards):
actions.append(('forward', i))
# Phase 2: Steady state -- alternating 1F1B
fwd_idx = warmup_forwards
bwd_idx = 0
for _ in range(steady_state_pairs):
actions.append(('forward', fwd_idx))
actions.append(('backward', bwd_idx))
fwd_idx += 1
bwd_idx += 1
# Phase 3: Cooldown -- only backwards
for i in range(cooldown_backwards):
actions.append(('backward', bwd_idx + i))
return actions
Memory Comparison: GPipe vs 1F1B
| Config | GPipe Activations | 1F1B Activations | Memory Reduction |
|---|---|---|---|
| P=4, M=16 | 16 per stage | 4 per stage | 4x |
| P=4, M=32 | 32 per stage | 4 per stage | 8x |
| P=8, M=32 | 32 per stage | 8 per stage | 4x |
| P=8, M=64 | 64 per stage | 8 per stage | 8x |
| P=16, M=128 | 128 per stage | 16 per stage | 8x |
The original PipeDream paper proposed asynchronous weight updates — each stage would apply its gradient immediately without waiting for other stages. This introduces weight staleness: forward and backward passes for the same micro-batch may use different weight versions. PipeDream-Flush (and Megatron-LM’s 1F1B implementation) eliminated this by using synchronous gradient accumulation, exactly like GPipe. The 1F1B schedule as used in practice today does NOT have weight staleness.
Interleaved Pipeline Stages (Megatron-LM)
Narayanan et al. (2021) in the Megatron-LM paper introduced interleaved pipeline stages, also known as “virtual pipeline stages.” The idea: instead of assigning each GPU a single contiguous block of layers, assign each GPU non-consecutive blocks (“virtual stages”).
For example, with GPUs, layers, and virtual stages per GPU:
Standard pipeline (V=1):
GPU 0: layers 0- 7 (stage 0)
GPU 1: layers 8-15 (stage 1)
GPU 2: layers 16-23 (stage 2)
GPU 3: layers 24-31 (stage 3)
Interleaved pipeline (V=2):
GPU 0: layers 0- 3 (virtual stage 0) + layers 16-19 (virtual stage 4)
GPU 1: layers 4- 7 (virtual stage 1) + layers 20-23 (virtual stage 5)
GPU 2: layers 8-11 (virtual stage 2) + layers 24-27 (virtual stage 6)
GPU 3: layers 12-15 (virtual stage 3) + layers 28-31 (virtual stage 7)
The total number of virtual pipeline stages is . The pipeline sees stages, but each “stage” is smaller (fewer layers). A micro-batch passes through all virtual stages, bouncing between GPUs multiple times. The critical insight: each virtual stage takes the time to compute, so the pipeline fill and drain time is divided by .
The bubble fraction becomes:
Note that appears in the denominator multiplicatively. With , , :
Compare this to GPipe’s with the same .
Bubble Fraction: Standard vs Interleaved Pipeline (P=8)
(%)The Communication Cost of Interleaving
There is no free lunch. With virtual stages per GPU, a micro-batch must transfer activations between GPUs times instead of times. Each transfer is smaller (fewer layers means smaller activation tensors — actually the same size since it is the same hidden dimension, but the number of transfers increases by ).
More precisely, with standard pipeline, there are inter-device activation transfers per micro-batch per direction. With interleaved pipeline (V virtual stages), there are transfers. The volume per transfer is identical (same hidden dimension), so total communication increases by roughly a factor of .
This trade-off works well when:
- The network bandwidth between GPUs is high relative to compute time per stage
- The bubble reduction from interleaving outweighs the communication overhead
- You use NVLink within a node (where bandwidth is abundant)
It works poorly when:
- Stages are already communication-bound
- Inter-node bandwidth is limited (the extra transfers saturate the link)
class InterleavedPipelineSchedule:
"""
Interleaved (virtual) pipeline stages as in Megatron-LM.
Each GPU holds V non-consecutive chunks of layers.
Bubble fraction: (P-1) / (M*V) instead of (P-1) / (P-1+M).
Communication increases by factor V.
"""
def __init__(self, num_gpus: int, virtual_stages: int, num_microbatches: int):
self.P = num_gpus
self.V = virtual_stages
self.M = num_microbatches
self.total_virtual_stages = num_gpus * virtual_stages
def bubble_fraction(self) -> float:
return (self.P - 1) / (self.M * self.V)
def communication_overhead_ratio(self) -> float:
"""Ratio of communication vs standard pipeline."""
standard_transfers = self.P - 1
interleaved_transfers = self.P * self.V - 1
return interleaved_transfers / standard_transfers
def layer_assignment(self, gpu_id: int, total_layers: int) -> list:
"""Which layers does this GPU hold?"""
layers_per_virtual_stage = total_layers // self.total_virtual_stages
assigned = []
for v in range(self.V):
virtual_stage_id = gpu_id + v * self.P
start = virtual_stage_id * layers_per_virtual_stage
end = start + layers_per_virtual_stage
assigned.extend(range(start, end))
return assigned
Interleaved Pipeline: Bubble vs Communication Trade-off (P=8, M=32)
| Virtual Stages (V) | Bubble % | Comm Overhead vs V=1 | Net Throughput Gain |
|---|---|---|---|
| V=1 (standard) | 17.9% | 1.0x | Baseline |
| V=2 | 10.9% | 2.1x | +5-8% |
| V=4 | 5.5% | 4.4x | +8-12% |
| V=8 | 2.7% | 9.0x | +5-10% |
| V=16 | 1.4% | 18.1x | Comm-bound |
Megatron-LM experiments show to is usually optimal for inter-node pipeline parallelism. Beyond , the communication overhead from extra activation transfers typically outweighs the bubble reduction. For intra-node pipelines over NVLink, higher can be beneficial because NVLink bandwidth is abundant.
DualPipe: Bidirectional Pipeline (DeepSeek-V3)
DeepSeek-V3 (Dai et al., 2024) introduced DualPipe, a bidirectional pipeline schedule that achieves near-zero bubble overhead. The core idea: send micro-batches from both ends of the pipeline simultaneously. Half the micro-batches flow forward (stage 0 to stage P-1) and the other half flow backward (stage P-1 to stage 0).
The insight is that when a GPU is executing the forward pass for a micro-batch flowing left-to-right, it can simultaneously execute the backward pass for a micro-batch flowing right-to-left. The forward computation of one micro-batch overlaps with the backward computation of another — they use different parts of the compute resources (forward is mostly matrix multiplications and activations; backward includes gradient computation and weight gradient accumulation).
DualPipe Schedule (P=4, simplified):
Left-to-right micro-batches: L0, L1, L2, L3
Right-to-left micro-batches: R0, R1, R2, R3
Time --> | 1 | 2 | 3 | 4 | 5 | 6 |
Stage 0: | F(L0) | F(L1) |F(L2)+B(R0)|F(L3)+B(R1)| B(R2) | B(R3) |
Stage 1: | | F(L0) |F(L1)+B(R1)|F(L2)+B(R0)| ... | |
Stage 2: | | F(R0) |F(R1)+B(L0)|F(R0)+B(L1)| ... | |
Stage 3: | F(R0) | F(R1) |F(R2)+B(L0)|F(R3)+B(L1)| B(L2) | B(L3) |
In steady state, each GPU runs forward + backward simultaneously.
The bubble is only the initial fill and final drain.
The key innovation: in steady state, every GPU is executing both a forward pass and a backward pass at every time step. The forward pass for one direction’s micro-batch overlaps with the backward pass for the other direction’s micro-batch. This means the bubble is only in the pipeline fill and drain, which is minimal.
DualPipe achieves an effective bubble fraction that approaches:
The factor of 2 improvement over standard 1F1B comes from the bidirectional overlap. For , :
But in practice, the overlap between forward and backward computation can hide even more of the bubble. DeepSeek reports near-zero effective bubble with sufficient micro-batches.
Bubble Comparison Across Pipeline Schedules (P=8, M=32)
(%)class DualPipeSchedule:
"""
DualPipe bidirectional pipeline schedule (DeepSeek-V3).
Micro-batches flow from both ends simultaneously.
Forward of left-to-right overlaps with backward of right-to-left.
"""
def __init__(self, stages: int, num_microbatches: int):
self.P = stages
self.M = num_microbatches
# Split micro-batches into two streams
self.M_left = num_microbatches // 2 # flow left-to-right
self.M_right = num_microbatches // 2 # flow right-to-left
def bubble_fraction(self) -> float:
"""Theoretical bubble before overlap hiding."""
return (self.P - 1) / (2 * self.M)
def effective_bubble(self, overlap_efficiency: float = 0.8) -> float:
"""
Effective bubble after forward/backward overlap.
overlap_efficiency: fraction of bubble hidden by overlapping
forward and backward of different micro-batches.
"""
theoretical = self.bubble_fraction()
return theoretical * (1 - overlap_efficiency)
def schedule_for_stage(self, stage_id: int):
"""
Each stage handles two streams:
- Left stream: micro-batches 0..M_left-1 flowing stage 0 -> P-1
- Right stream: micro-batches 0..M_right-1 flowing stage P-1 -> 0
In steady state, forward(left) overlaps with backward(right)
and forward(right) overlaps with backward(left).
"""
actions = []
# Warmup: fill pipeline from both ends
for t in range(stage_id):
actions.append(('idle', t))
# Steady state: overlapped forward + backward
for mb in range(self.M_left):
actions.append(('forward_left + backward_right', mb))
# Cooldown: drain from both ends
for t in range(self.P - 1 - stage_id):
actions.append(('drain', t))
return actions
DualPipe requires that forward and backward computation can overlap efficiently on the same GPU. This works well for transformer models where forward is dominated by matrix multiplications and backward by gradient matrix multiplications — they can overlap on different CUDA streams if the GPU has sufficient compute units and memory bandwidth. Models with heavy sequential dependencies or irregular computation patterns may not overlap as well.
Memory Analysis: The Activation Storage Problem
Every pipeline schedule must store activations (intermediate results from the forward pass) until the corresponding backward pass consumes them. The number of stored activations determines peak memory usage, which directly limits model size and micro-batch size.
Activation Memory Per Micro-Batch
For a transformer layer with hidden dimension , sequence length , and micro-batch size , the stored activations include:
- Input to each layer: elements
- Attention intermediate states: elements (for full attention)
- MLP intermediate activations: elements (for standard FFN)
- Layer norm inputs: elements
In BF16 (2 bytes per element), a single transformer layer’s activations are approximately:
For typical values (, , , ):
If a pipeline stage has 10 layers, the activation memory per micro-batch per stage is roughly 28 GB. Now multiply by the number of in-flight micro-batches.
GPU Memory Budget Per Pipeline Stage (H100 80GB)
Memory allocation for a single pipeline stage in a 405B model across 8 GPUs
0x3200 0x0000 0x6400 0x3200 0x7D00 0x6400 0xA000 0x7D00 0xC800 0xA000 ~12.5 GB ~25 GB ~12.5 GB ~20-30 GB ~0-10 GB Activation Memory Per Stage: GPipe vs 1F1B (10 layers/stage, b=1, s=4096, h=8192)
| Schedule | In-Flight MBs | Activation Memory | Feasible on 80GB H100? |
|---|---|---|---|
| GPipe M=4 | 4 | ~11 GB | Yes (tight) |
| GPipe M=16 | 16 | ~45 GB | No -- exceeds budget |
| GPipe M=32 | 32 | ~90 GB | No -- far exceeds |
| 1F1B P=4 | 4 | ~11 GB | Yes |
| 1F1B P=8 | 8 | ~22 GB | Yes |
| 1F1B P=16 | 16 | ~45 GB | Marginal |
Activation Checkpointing (Recomputation)
Activation checkpointing (also called gradient checkpointing or activation recomputation) is the standard solution to the activation memory problem. Instead of storing all intermediate activations, you store only the input activation at each layer boundary and recompute the intermediate activations during the backward pass.
The trade-off is straightforward: you trade compute for memory. With full activation checkpointing, you store only 1 activation tensor per layer instead of ~4-5, reducing activation memory by roughly 4-5x. The cost is recomputing the forward pass during backward, adding approximately 33% more total compute (the forward pass is about 1/3 of the total forward + backward compute).
class ActivationCheckpointing:
"""
Activation checkpointing reduces memory by recomputing
activations during backward pass instead of storing them.
"""
def __init__(self, checkpoint_every_n_layers: int = 1):
self.checkpoint_interval = checkpoint_every_n_layers
def memory_reduction(self, layers_per_stage: int) -> float:
"""Factor of memory reduction vs no checkpointing."""
# Without checkpointing: store all activations for all layers
# With checkpointing: store only checkpoint activations
checkpoints = layers_per_stage // self.checkpoint_interval
# Each checkpoint stores only the input tensor (not intermediates)
# Intermediates are ~4-5x the input tensor
return layers_per_stage * 5 / (checkpoints + layers_per_stage)
def compute_overhead(self) -> float:
"""Additional compute as a fraction of total."""
# Recompute forward pass during backward
# Forward = ~1/3 of total (forward + backward)
# So recomputation adds ~33% to total compute
return 0.33
def forward_with_checkpointing(self, layers, input_tensor):
"""
Only store activations at checkpoint boundaries.
Recompute intermediates during backward.
"""
checkpointed_activations = [input_tensor]
x = input_tensor
for i, layer in enumerate(layers):
# Use torch.utils.checkpoint to wrap the forward
# This stores input but not intermediate activations
x = torch.utils.checkpoint.checkpoint(layer, x)
if (i + 1) % self.checkpoint_interval == 0:
checkpointed_activations.append(x)
return x, checkpointed_activations
Selective activation checkpointing is a refinement: only recompute the cheap-to-recompute activations (like attention scores, which are memory-heavy but compute-cheap relative to matrix multiplications). This gives most of the memory savings with less compute overhead.
Activation Checkpointing Trade-offs
| Strategy | Memory Savings | Compute Overhead | Typical Use Case |
|---|---|---|---|
| No checkpointing | 0% | 0% | Small models, abundant memory |
| Full checkpointing | ~80% | ~33% | Very large models, memory-constrained |
| Selective (attn only) | ~50-60% | ~10-15% | Medium models, balanced trade-off |
| Every-2-layers | ~50% | ~16% | Moderate memory pressure |
Combining Pipeline, Tensor, and Data Parallelism
In practice, large-scale training uses all three parallelism strategies simultaneously. The typical recipe for a cluster with total GPUs, where each node has GPUs connected by NVLink:
- Tensor parallelism (TP) within a node: (e.g., 8 GPUs per node). All-reduce operations stay on NVLink.
- Pipeline parallelism (PP) across nodes: number of nodes in a pipeline. Point-to-point activation transfers go over InfiniBand.
- Data parallelism (DP) across pipeline replicas: . Gradient all-reduce across DP replicas.
Total GPUs: .
Example: Training Llama 405B on 1024 H100s (128 nodes, 8 GPUs/node)
TP = 8 (within each node, NVLink)
PP = 8 (across 8 nodes per pipeline, InfiniBand)
DP = 16 (16 pipeline replicas, InfiniBand)
Total = 8 * 8 * 16 = 1024 GPUs
Each pipeline replica:
- 8 nodes, 64 GPUs
- Each node holds one pipeline stage
- Within each node, 8 GPUs share one stage via TP
- Model: 126 layers / 8 stages = ~16 layers per stage
- Each TP group of 8 GPUs holds 16 layers with weights sharded 8-way
Data parallelism:
- 16 replicas of this 64-GPU pipeline
- Global batch split 16 ways
- Gradient all-reduce across 16 replicas after each mini-batch
3D Parallelism Layout: 1024 GPUs for 405B Model
TP=8 (intra-node NVLink), PP=8 (inter-node IB), DP=16 (across pods)
Communication Patterns in 3D Parallelism
Each parallelism dimension has a distinct communication pattern:
Tensor parallelism (most frequent, highest bandwidth):
- 2 all-reduce operations per transformer layer
- Volume: per all-reduce
- Must complete within the compute time of one layer
- Requires NVLink-class bandwidth (hundreds of GB/s)
Pipeline parallelism (moderate frequency, moderate bandwidth):
- 1 point-to-point transfer per pipeline stage boundary per micro-batch
- Volume: per transfer
- Can overlap with compute at adjacent stages
- InfiniBand sufficient (tens of GB/s)
Data parallelism (least frequent, can overlap):
- 1 all-reduce of all gradients per mini-batch
- Volume:
- Can overlap with backward pass computation (ring all-reduce)
- InfiniBand sufficient, can use network efficiently
class ThreeDParallelConfig:
"""
Configuration for 3D parallelism: TP + PP + DP.
"""
def __init__(self, tp: int, pp: int, dp: int,
gpus_per_node: int = 8):
self.tp = tp
self.pp = pp
self.dp = dp
self.total_gpus = tp * pp * dp
self.gpus_per_node = gpus_per_node
self.nodes = self.total_gpus // gpus_per_node
# Validate: TP should fit within a node
assert tp <= gpus_per_node, \
f"TP={tp} exceeds GPUs per node={gpus_per_node}"
def communication_volume_per_step(self, model_params: int,
hidden: int, seq_len: int,
micro_batch: int,
layers_per_stage: int) -> dict:
"""Estimate communication volumes per training step."""
activation_size = micro_batch * seq_len * hidden * 2 # BF16
# TP: 2 all-reduces per layer, within node
tp_volume = 2 * layers_per_stage * activation_size * (self.tp - 1) / self.tp
tp_link = "NVLink"
# PP: 1 point-to-point per stage boundary per micro-batch
# Number of micro-batches depends on schedule
pp_volume = activation_size # per micro-batch per boundary
pp_link = "InfiniBand" if self.tp == self.gpus_per_node else "NVLink"
# DP: all-reduce of all gradients, once per mini-batch
dp_volume = model_params * 2 # BF16 gradients
dp_link = "InfiniBand"
return {
'tp': {'volume_bytes': tp_volume, 'link': tp_link,
'frequency': 'per layer per micro-batch'},
'pp': {'volume_bytes': pp_volume, 'link': pp_link,
'frequency': 'per micro-batch per boundary'},
'dp': {'volume_bytes': dp_volume, 'link': dp_link,
'frequency': 'once per mini-batch (overlapped)'},
}
Communication Volume by Parallelism Dimension (405B, h=13824, s=4096, b=1)
| Dimension | Volume Per Event | Frequency | Link Type | Latency Sensitivity |
|---|---|---|---|---|
| TP (all-reduce) | ~214 MB | 2x per layer per MB | NVLink | Very high |
| PP (point-to-point) | ~107 MB | 1x per stage boundary | InfiniBand | Moderate |
| DP (all-reduce) | ~810 GB | 1x per mini-batch | InfiniBand | Low (overlapped) |
Expert Parallelism and MoE
For Mixture-of-Experts models like DeepSeek-V3, there is a fourth dimension: expert parallelism (EP). Experts are distributed across GPUs, and an all-to-all communication pattern routes tokens to the correct expert. This adds another layer of complexity to the parallelism strategy but is orthogonal to pipeline parallelism.
Practical Concerns: Load Balancing Pipeline Stages
For pipeline parallelism to be efficient, every stage must take approximately the same time to compute. If one stage takes twice as long, every other stage waits for it, creating a bottleneck worse than the inherent bubble.
Transformer models are relatively easy to balance because every layer has the same architecture and roughly the same compute cost. You simply assign equal numbers of layers per stage: layers each. The first and last stages may have slight additional work (embedding lookup, output projection), but this is minor.
Non-uniform architectures (models with varying layer sizes, skip connections across stages, or different computation at different depths) are harder to balance. Profiling each layer’s compute time and partitioning to equalize stage times is necessary.
def balance_pipeline_stages(layer_compute_times: list, num_stages: int) -> list:
"""
Partition layers into stages with balanced compute times.
Uses dynamic programming to minimize the maximum stage time.
Args:
layer_compute_times: list of per-layer compute times in ms
num_stages: number of pipeline stages P
Returns:
list of (start_layer, end_layer) tuples per stage
"""
n = len(layer_compute_times)
prefix_sum = [0] * (n + 1)
for i in range(n):
prefix_sum[i + 1] = prefix_sum[i] + layer_compute_times[i]
# dp[i][j] = min possible max-stage-time using i stages for first j layers
INF = float('inf')
dp = [[INF] * (n + 1) for _ in range(num_stages + 1)]
dp[0][0] = 0
for s in range(1, num_stages + 1):
for j in range(s, n + 1):
for k in range(s - 1, j):
stage_time = prefix_sum[j] - prefix_sum[k]
dp[s][j] = min(dp[s][j], max(dp[s-1][k], stage_time))
# Backtrack to find partition
# (omitted for brevity -- standard DP backtracking)
return partition_from_dp(dp, prefix_sum, num_stages)
The first pipeline stage typically includes the token embedding layer, and the last stage includes the output projection (language model head). For large-vocabulary models (e.g., 128K tokens), the embedding and output projection can be significant: a 128K vocabulary with hidden size 8192 in BF16 is about 2 GB. This asymmetry means the first and last stages have more memory pressure and slightly different compute profiles. Some implementations split the embedding across stages or tie the embedding and output projection weights to mitigate this.
When Pipeline Parallelism Is the Wrong Choice
Pipeline parallelism is not always the answer. There are scenarios where it introduces more overhead than it solves:
Models That Fit on One Node
If a model fits within a single node’s aggregate GPU memory (e.g., a 70B model on 8 H100s with 640 GB total), tensor parallelism alone is usually better. TP within a node over NVLink has:
- Lower latency than PP’s pipeline bubble
- No idle time from fill/drain phases
- Simpler implementation and debugging
Pipeline parallelism only becomes necessary when you need to span multiple nodes, where TP’s all-reduce latency over InfiniBand becomes prohibitive.
Very Small Micro-Batch Sizes
When the global batch size is small or cannot be divided into enough micro-batches, the bubble dominates. If is close to , the bubble fraction approaches 50% even with 1F1B:
At small , you are better off using fewer pipeline stages (accepting less parallelism) or using other strategies entirely.
Latency-Sensitive Inference
Pipeline parallelism is designed for throughput, not latency. For a single request:
- The request must pass through all stages sequentially
- Latency =
- No opportunity for micro-batching (only one sequence)
For inference, tensor parallelism is preferred because it reduces per-request latency by parallelizing each layer’s computation. Pipeline parallelism can help inference throughput when processing many concurrent requests (similar to micro-batching), but for latency-critical single-request scenarios, it adds overhead.
Heterogeneous Hardware
Pipeline parallelism assumes all stages take equal time. If your cluster has mixed GPU generations (e.g., some A100s and some H100s), the slowest GPU becomes the bottleneck for every stage. DP is more forgiving of heterogeneous hardware because slower workers simply process smaller batch portions.
When to Use Each Parallelism Strategy
| Scenario | Best Strategy | Why Not PP? | Recommendation |
|---|---|---|---|
| 70B on 8xH100 (1 node) | TP=8 | No bubble, NVLink fast enough | TP only |
| 405B on 64 GPUs (8 nodes) | TP=8, PP=8 | Need PP for inter-node | TP + PP |
| 405B on 1024 GPUs | TP=8, PP=8, DP=16 | Full 3D parallelism | TP + PP + DP |
| 7B model, many GPUs | DP only | Model fits on 1 GPU, PP bubble wasted | DP |
| Single-request inference | TP | PP adds latency, no micro-batching | TP |
| High-throughput inference | TP + PP | PP helps throughput with batched requests | TP + PP |
Summary of Pipeline Schedules
The evolution of pipeline parallelism schedules represents steady progress toward eliminating the bubble:
Pipeline Schedule Evolution
| Schedule | Year | Bubble Formula | Peak Activations | Key Innovation |
|---|---|---|---|---|
| Naive | N/A | (P-1)/P | 1 | No micro-batching |
| GPipe | 2019 | (P-1)/(P-1+M) | M | Micro-batch pipelining |
| 1F1B (PipeDream-Flush) | 2019 | (P-1)/(P-1+M) | P | Interleaved F/B, less memory |
| Interleaved (Megatron) | 2021 | (P-1)/(M*V) | P | Virtual stages, smaller bubble |
| DualPipe (DeepSeek-V3) | 2024 | ~(P-1)/(2M) | ~2P | Bidirectional, overlap F+B |
def compare_schedules(P: int, M: int, V: int = 1) -> dict:
"""
Compare bubble fractions and memory requirements
across pipeline parallelism schedules.
Args:
P: number of pipeline stages
M: number of micro-batches
V: number of virtual stages (for interleaved schedule)
"""
return {
'naive': {
'bubble': (P - 1) / P,
'peak_activations': 1,
'description': 'Single batch, no pipelining'
},
'gpipe': {
'bubble': (P - 1) / (P - 1 + M),
'peak_activations': M,
'description': 'All forwards then all backwards'
},
'1f1b': {
'bubble': (P - 1) / (P - 1 + M),
'peak_activations': P,
'description': 'Alternating forward and backward'
},
'interleaved': {
'bubble': (P - 1) / (M * V),
'peak_activations': P,
'communication_factor': V,
'description': f'Virtual stages V={V}, more comm but less bubble'
},
'dualpipe': {
'bubble': (P - 1) / (2 * M),
'peak_activations': 2 * P, # approximate
'description': 'Bidirectional, forward+backward overlap'
}
}
# Example: P=8 stages, M=32 micro-batches, V=4 virtual stages
results = compare_schedules(P=8, M=32, V=4)
# naive: 87.5% bubble
# gpipe: 17.9% bubble, 32 activations
# 1f1b: 17.9% bubble, 8 activations
# interleaved: 5.5% bubble, 8 activations, 4x comm
# dualpipe: 10.9% bubble (before overlap hiding), ~16 activations
Effective Throughput Relative to Ideal (P=8, M=32)
(% of ideal)Implementation Reference: Frameworks and Their Schedules
Major distributed training frameworks implement different subsets of these schedules:
Megatron-LM (NVIDIA): Supports 1F1B and interleaved schedules. The de facto standard for large-scale LLM training. Tight integration with TP and DP.
DeepSpeed (Microsoft): Supports GPipe-style and 1F1B schedules via its pipeline engine. Part of the ZeRO ecosystem for memory optimization.
FairScale / PyTorch FSDP: PyTorch-native pipeline parallelism with 1F1B schedule. Integrates with FSDP for data parallelism.
DeepSeek’s training infrastructure: Custom DualPipe implementation. Not yet available as a general-purpose library (as of early 2025).
# Megatron-LM example configuration for 3D parallelism
megatron_config = {
'tensor_model_parallel_size': 8, # TP = 8 (intra-node)
'pipeline_model_parallel_size': 8, # PP = 8 (inter-node)
'data_parallel_size': 16, # DP = 16
'virtual_pipeline_model_parallel_size': 4, # V = 4 (interleaved)
'num_micro_batches': 32, # M = 32
'global_batch_size': 2048, # split across DP replicas
'micro_batch_size': 4, # per micro-batch
'sequence_length': 4096,
'num_layers': 126, # Llama 405B
'hidden_size': 13824,
'num_attention_heads': 128,
'fp16': False,
'bf16': True,
'recompute_activations': True, # activation checkpointing
'recompute_granularity': 'selective', # only recompute attention
}
Conclusion
Pipeline parallelism exists to solve a specific problem: training models that exceed single-node memory while avoiding the bandwidth penalty of cross-node tensor parallelism. The core challenge is the pipeline bubble — idle GPU time during pipeline fill and drain.
The evolution from naive pipelining through GPipe, 1F1B, interleaved stages, and DualPipe represents a progression from 87.5% idle time to under 5% effective idle time for typical configurations. Each advance either shrinks the bubble (more micro-batches, virtual stages, bidirectional flow) or reduces the memory cost of shrinking it (1F1B’s bounded activation storage, activation checkpointing).
The practical recipe for large-scale training has converged: tensor parallelism within a node (where NVLink provides the bandwidth for frequent all-reduce), pipeline parallelism across nodes (where only activation tensors cross the slower InfiniBand links), and data parallelism across pipeline replicas (where gradient all-reduce can overlap with backward computation). Understanding the trade-offs between bubble fraction, activation memory, and communication volume is essential for configuring these systems efficiently.
The bubble is the tax you pay for splitting a model across devices. The history of pipeline parallelism is the history of minimizing that tax.