Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble

Part of Series Inference Optimization Timeline 43 of 60

1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Llama 3 405B requires 810 GB just for weights in BF16. An H100 has 80 GB. The model physically cannot fit on one GPU, so you must distribute it across multiple devices. Tensor parallelism splits every operation across GPUs with an all-reduce per layer — fast within a node over NVLink, catastrophic across nodes over InfiniBand because 160 all-reduces per forward pass drowns the network. Pipeline parallelism splits the model by layers: Device 0 gets layers 0-9, Device 1 gets layers 10-19, and only activation tensors cross device boundaries. Communication volume drops by 20x, but you introduce pipeline bubbles where GPUs sit idle waiting for the next micro-batch. The best schedules — PipeDream 1F1B, Megatron interleaved, DeepSeek DualPipe — minimize bubbles to under 10%. This post covers the algorithms, the bubble math, and when PP beats TP.

There are three fundamental strategies for distributing a model across devices:

Data parallelism (DP) — replicate the full model on every device, split the data. This does not help when the model itself exceeds one device’s memory.
Tensor parallelism (TP) — split individual operations (matrix multiplications, attention heads) across devices. Every operation requires an all-reduce across participating GPUs, so it demands extremely high-bandwidth interconnect. Within a single node using NVLink (900 GB/s bidirectional on H100), TP works well. Across nodes over InfiniBand (50-100 GB/s per link), the latency and bandwidth penalty is severe.
Pipeline parallelism (PP) — split the model by layers. Device 0 holds layers 0-9, device 1 holds layers 10-19, and so on. Only activation tensors cross device boundaries, and they cross in one direction at a time. The communication volume is dramatically smaller than TP’s all-reduce pattern.

ℹ️ Why Not Just Use Tensor Parallelism Everywhere?

Tensor parallelism requires two all-reduce operations per transformer layer (one for the attention projection, one for the MLP). For a model with hidden dimension $h = 8192$ and sequence length $s = 4096$ , each all-reduce communicates $O(s \times h)$ bytes. Over InfiniBand at 50 GB/s effective bandwidth, this adds 100-200 microseconds of latency per layer. With 80+ layers, cross-node TP can spend more time communicating than computing. Pipeline parallelism replaces this with a single point-to-point transfer of activations between adjacent stages — orders of magnitude less communication.

Pipeline parallelism is the natural choice for inter-node distribution. You assign consecutive layers to each node, and the only data that crosses the (slow) inter-node network is the activation tensor at layer boundaries. For a transformer with hidden size $h$ and micro-batch size $b$ with sequence length $s$ , the activation tensor at each boundary is $b \times s \times h$ elements. Compare this to TP, which would require all-reduce of the same size tensor twice per layer across all nodes.

📊

Communication Volume Per Training Step (80-layer Transformer, h=8192)

Strategy	Nodes	Per-Layer Comm	Total Comm	Interconnect Need
TP across 8 nodes	8	2 all-reduce of bsh	160 all-reduces	NVLink-class
PP across 8 nodes	8	1 point-to-point of bsh	7 point-to-point	InfiniBand OK
TP intra-node + PP inter-node	8	2 all-reduce (local) + 1 p2p	Hybrid	Optimal

But pipeline parallelism has a fundamental problem: the pipeline bubble.

The Bubble Problem

Consider a naive pipeline with $P = 4$ stages processing a single mini-batch. Stage 0 computes the forward pass for layers 0-19, then sends the activation to stage 1. Stage 1 computes layers 20-39, sends to stage 2, and so on. During this entire forward cascade, stages 1, 2, and 3 are idle while waiting for their input. Then during the backward pass, stages cascade in reverse. The result is that each stage is active for only a small fraction of the total time.

Naive Pipeline (P=4, single batch):

Time -->  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
Stage 0:  | F0  |     |     |     |     |     |     | B0  |
Stage 1:  |     | F0  |     |     |     |     | B0  |     |
Stage 2:  |     |     | F0  |     |     | B0  |     |     |
Stage 3:  |     |     |     | F0  | B0  |     |     |     |

F0 = forward micro-batch 0, B0 = backward micro-batch 0
Empty cells = GPU sitting idle (the "bubble")

The idle fraction in this naive schedule is:

$\text{Bubble fraction} = \frac{P - 1}{P}$

For $P = 8$ pipeline stages, this means $7/8 = 87.5\%$ of GPU time is wasted. You are paying for 8 GPUs but getting the throughput of slightly more than 1. This is unacceptable.

Naive Pipeline Bubble Fraction by Number of Stages

(%)

P=2 50% idle

50 %

P=4 75% idle

75 %

+50.0%

P=8 87.5% idle

87.5 %

+75.0%

P=16 93.75% idle

93.75 %

+87.5%

P=32 96.88% idle

96.88 %

+93.8%

The entire history of pipeline parallelism research is, in essence, the quest to shrink this bubble.

GPipe: Micro-Batching to Shrink the Bubble

GPipe (Huang et al., 2019) introduced the foundational insight: split the mini-batch into $M$ micro-batches and pipeline them through the stages. Instead of waiting for the entire forward pass to complete before starting the backward pass, you feed micro-batches into the pipeline one after another. While stage 0 is processing micro-batch 1, stage 1 is processing micro-batch 0 from the previous time step.

GPipe Schedule (P=4, M=8 micro-batches):

Time -->  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | ... | 19 |
Stage 0:  | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |    |    |     | B0 |
Stage 1:  |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |    |     |    |
Stage 2:  |    |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |     |    |
Stage 3:  |    |    |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | B7  |    |

All forward passes complete, then all backward passes run.
Bubble is only the fill and drain phases: (P-1) time slots at each end.

The key schedule is: all $M$ micro-batches execute their forward passes in pipeline fashion, then all $M$ micro-batches execute their backward passes. Gradients are accumulated across micro-batches and the optimizer step happens once per mini-batch.

The bubble fraction becomes:

$\text{Bubble fraction (GPipe)} = \frac{P - 1}{P - 1 + M}$

This is a dramatic improvement. With $P = 8$ stages and $M = 32$ micro-batches:

$\text{Bubble} = \frac{7}{7 + 32} = \frac{7}{39} \approx 17.9\%$

Compare this to 87.5% with the naive approach.

📊

GPipe Bubble Fraction: Effect of Micro-batch Count M

Micro-batches (M)	P=4	P=8	P=16	P=32
M=4	42.9%	63.6%	78.9%	88.6%
M=8	27.3%	46.7%	65.2%	79.5%
M=16	15.8%	30.4%	48.4%	66.0%
M=32	8.6%	17.9%	32.6%	49.2%
M=64	4.5%	9.9%	19.0%	32.6%
M=128	2.3%	5.2%	10.5%	19.5%

class GPipeSchedule:
    """
    GPipe micro-batch pipeline schedule.

    All forwards run first (fill phase), then all backwards (drain phase).
    Gradient accumulation happens across micro-batches.
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches

    def bubble_fraction(self) -> float:
        return (self.P - 1) / (self.P - 1 + self.M)

    def execute(self, mini_batch):
        micro_batches = split(mini_batch, self.M)

        # Phase 1: All forward passes (pipeline fill + steady state)
        # Each micro-batch flows through all P stages
        activations = {}  # (stage, mb_idx) -> activation tensor
        for mb_idx in range(self.M):
            for stage in range(self.P):
                input_act = micro_batches[mb_idx] if stage == 0 \
                           else activations[(stage - 1, mb_idx)]
                activations[(stage, mb_idx)] = self.forward(stage, input_act)

        # Phase 2: All backward passes (pipeline drain)
        # Must store ALL activations from forward phase -- memory intensive
        accumulated_grads = [zeros_like(p) for p in self.parameters()]
        for mb_idx in reversed(range(self.M)):
            for stage in reversed(range(self.P)):
                grads = self.backward(stage, activations[(stage, mb_idx)])
                accumulate(accumulated_grads, grads)

        # Phase 3: Single optimizer step
        self.optimizer_step(accumulated_grads)

⚠️ GPipe Memory Problem

GPipe must store activations for ALL $M$ micro-batches across ALL $P$ stages simultaneously during the forward phase, because backward passes do not begin until all forward passes complete. For $M = 32$ micro-batches with hidden size $h = 8192$ and sequence length $s = 4096$ in BF16, each activation tensor is $b \times s \times h \times 2$ bytes. Even with micro-batch size $b = 1$ , that is 64 MB per activation. Storing 32 of these per stage means 2 GB of activation memory per stage — and this is just one layer boundary. The actual activation memory across all layers in a stage is much larger.

GPipe’s Gradient Accumulation Semantics

A subtle but important property of GPipe: because all forward passes complete before any backward pass begins, the gradient computation is mathematically identical to processing the full mini-batch at once (up to floating-point ordering differences). Each micro-batch’s forward pass uses the same model weights. The accumulated gradients across micro-batches equal the gradient of the full mini-batch loss.

This is a clean semantic guarantee. The training dynamics are identical to single-device training with the same effective batch size. No weight staleness, no approximation.

PipeDream 1F1B: Interleaving Forward and Backward

PipeDream (Narayanan et al., 2019) and its successor PipeDream-Flush (also called the “1F1B schedule”) address GPipe’s memory problem by interleaving forward and backward passes. Instead of completing all forwards before starting any backward, each stage alternates: one forward, one backward, one forward, one backward.

1F1B Schedule (P=4, M=8 micro-batches):

Time -->  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | ...
Stage 0:  | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | F7 | ...
Stage 1:  |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | ...
Stage 2:  |    |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | ...
Stage 3:  |    |    |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | ...

Warmup: each stage does (P - stage_id - 1) extra forwards before entering steady state.
Steady state: strictly alternating 1 forward + 1 backward.
Cooldown: remaining backwards drain the pipeline.

The schedule has three phases:

Warmup phase: Stage $i$ performs $(P - 1 - i)$ forward passes to fill the pipeline. Stage 0 does $(P-1)$ forwards, stage 1 does $(P-2)$ , and so on. The last stage does 0 warmup forwards.
Steady-state phase: Each stage alternates exactly one forward pass and one backward pass (hence “1F1B”). This is the bulk of the computation.
Cooldown phase: Remaining backward passes drain the pipeline.

The bubble fraction is the same as GPipe: $(P-1)/(P-1+M)$ . The critical difference is in memory.

The Memory Advantage of 1F1B

In GPipe, all $M$ forward passes complete before any backward pass starts. This means every stage must store activations for all $M$ micro-batches simultaneously. The peak activation memory per stage scales as $O(M)$ .

In 1F1B, once a stage enters steady state, it performs a backward pass immediately after each forward pass. The backward pass for micro-batch $k$ frees the activations for micro-batch $k$ . At any point during steady state, a stage has at most $P$ micro-batches’ activations in flight — one from each of the stages ahead of it in the warmup, plus the current one.

$\text{Peak activations (GPipe)} = M \text{ micro-batches}$ $\text{Peak activations (1F1B)} = P \text{ micro-batches}$

Since typically $M \gg P$ (you want many micro-batches to reduce the bubble), this is a major memory saving.

Activation Memory: GPipe vs 1F1B (P=4 stages)

Peak number of micro-batch activations stored simultaneously per stage

GPipe: M=32 micro-batch activations stored 32 x activation_size per stage All forwards complete before any backward begins

1F1B: P=4 micro-batch activations stored 4 x activation_size per stage Backward frees memory as soon as forward finishes

Memory savings with 1F1B 8x reduction (32/4) in activation memory Enables larger micro-batches or larger models

class OneFOneBSchedule:
    """
    1F1B (One Forward One Backward) pipeline schedule.

    Key insight: interleaving forward and backward passes limits
    the number of in-flight micro-batches to P (not M).
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches

    def peak_activations(self) -> int:
        """Max micro-batch activations stored at any stage."""
        return self.P  # Not M! This is the key advantage.

    def schedule_for_stage(self, stage_id: int):
        """Generate the action sequence for a single stage."""
        warmup_forwards = self.P - 1 - stage_id
        steady_state_pairs = self.M - warmup_forwards
        cooldown_backwards = warmup_forwards

        actions = []

        # Phase 1: Warmup -- only forwards
        for i in range(warmup_forwards):
            actions.append(('forward', i))

        # Phase 2: Steady state -- alternating 1F1B
        fwd_idx = warmup_forwards
        bwd_idx = 0
        for _ in range(steady_state_pairs):
            actions.append(('forward', fwd_idx))
            actions.append(('backward', bwd_idx))
            fwd_idx += 1
            bwd_idx += 1

        # Phase 3: Cooldown -- only backwards
        for i in range(cooldown_backwards):
            actions.append(('backward', bwd_idx + i))

        return actions

📊

Memory Comparison: GPipe vs 1F1B

Config	GPipe Activations	1F1B Activations	Memory Reduction
P=4, M=16	16 per stage	4 per stage	4x
P=4, M=32	32 per stage	4 per stage	8x
P=8, M=32	32 per stage	8 per stage	4x
P=8, M=64	64 per stage	8 per stage	8x
P=16, M=128	128 per stage	16 per stage	8x

💡 Weight Staleness in Original PipeDream

The original PipeDream paper proposed asynchronous weight updates — each stage would apply its gradient immediately without waiting for other stages. This introduces weight staleness: forward and backward passes for the same micro-batch may use different weight versions. PipeDream-Flush (and Megatron-LM’s 1F1B implementation) eliminated this by using synchronous gradient accumulation, exactly like GPipe. The 1F1B schedule as used in practice today does NOT have weight staleness.

Interleaved Pipeline Stages (Megatron-LM)

Narayanan et al. (2021) in the Megatron-LM paper introduced interleaved pipeline stages, also known as “virtual pipeline stages.” The idea: instead of assigning each GPU a single contiguous block of layers, assign each GPU $V$ non-consecutive blocks (“virtual stages”).

For example, with $P = 4$ GPUs, $L = 32$ layers, and $V = 2$ virtual stages per GPU:

Standard pipeline (V=1):
  GPU 0: layers  0- 7   (stage 0)
  GPU 1: layers  8-15   (stage 1)
  GPU 2: layers 16-23   (stage 2)
  GPU 3: layers 24-31   (stage 3)

Interleaved pipeline (V=2):
  GPU 0: layers  0- 3   (virtual stage 0) + layers 16-19 (virtual stage 4)
  GPU 1: layers  4- 7   (virtual stage 1) + layers 20-23 (virtual stage 5)
  GPU 2: layers  8-11   (virtual stage 2) + layers 24-27 (virtual stage 6)
  GPU 3: layers 12-15   (virtual stage 3) + layers 28-31 (virtual stage 7)

The total number of virtual pipeline stages is $P \times V$ . The pipeline sees $P \times V$ stages, but each “stage” is smaller (fewer layers). A micro-batch passes through all $P \times V$ virtual stages, bouncing between GPUs multiple times. The critical insight: each virtual stage takes $1/V$ the time to compute, so the pipeline fill and drain time is divided by $V$ .

The bubble fraction becomes:

$\text{Bubble fraction (interleaved)} = \frac{P - 1}{M \times V}$

Note that $V$ appears in the denominator multiplicatively. With $P = 8$ , $M = 32$ , $V = 4$ :

$\text{Bubble} = \frac{7}{32 \times 4} = \frac{7}{128} \approx 5.5\%$

Compare this to GPipe’s $7/39 \approx 17.9\%$ with the same $M$ .

Bubble Fraction: Standard vs Interleaved Pipeline (P=8)

(%)

GPipe M=32 (P-1)/(P-1+M)

17.9 %

Interleaved V=2, M=32 (P-1)/(M*V)

10.9 %

Interleaved V=4, M=32 (P-1)/(M*V)

5.5 %

Interleaved V=8, M=32 (P-1)/(M*V)

2.7 %

Interleaved V=4, M=64 (P-1)/(M*V)

2.7 %

The Communication Cost of Interleaving

There is no free lunch. With $V$ virtual stages per GPU, a micro-batch must transfer activations between GPUs $P \times V - 1$ times instead of $P - 1$ times. Each transfer is smaller (fewer layers means smaller activation tensors — actually the same size since it is the same hidden dimension, but the number of transfers increases by $V$ ).

More precisely, with standard pipeline, there are $P - 1$ inter-device activation transfers per micro-batch per direction. With interleaved pipeline (V virtual stages), there are $P \times V - 1$ transfers. The volume per transfer is identical (same hidden dimension), so total communication increases by roughly a factor of $V$ .

This trade-off works well when:

The network bandwidth between GPUs is high relative to compute time per stage
The bubble reduction from interleaving outweighs the communication overhead
You use NVLink within a node (where bandwidth is abundant)

It works poorly when:

Stages are already communication-bound
Inter-node bandwidth is limited (the extra transfers saturate the link)

class InterleavedPipelineSchedule:
    """
    Interleaved (virtual) pipeline stages as in Megatron-LM.

    Each GPU holds V non-consecutive chunks of layers.
    Bubble fraction: (P-1) / (M*V) instead of (P-1) / (P-1+M).
    Communication increases by factor V.
    """
    def __init__(self, num_gpus: int, virtual_stages: int, num_microbatches: int):
        self.P = num_gpus
        self.V = virtual_stages
        self.M = num_microbatches
        self.total_virtual_stages = num_gpus * virtual_stages

    def bubble_fraction(self) -> float:
        return (self.P - 1) / (self.M * self.V)

    def communication_overhead_ratio(self) -> float:
        """Ratio of communication vs standard pipeline."""
        standard_transfers = self.P - 1
        interleaved_transfers = self.P * self.V - 1
        return interleaved_transfers / standard_transfers

    def layer_assignment(self, gpu_id: int, total_layers: int) -> list:
        """Which layers does this GPU hold?"""
        layers_per_virtual_stage = total_layers // self.total_virtual_stages
        assigned = []
        for v in range(self.V):
            virtual_stage_id = gpu_id + v * self.P
            start = virtual_stage_id * layers_per_virtual_stage
            end = start + layers_per_virtual_stage
            assigned.extend(range(start, end))
        return assigned

📊

Interleaved Pipeline: Bubble vs Communication Trade-off (P=8, M=32)

Virtual Stages (V)	Bubble %	Comm Overhead vs V=1	Net Throughput Gain
V=1 (standard)	17.9%	1.0x	Baseline
V=2	10.9%	2.1x	+5-8%
V=4	5.5%	4.4x	+8-12%
V=8	2.7%	9.0x	+5-10%
V=16	1.4%	18.1x	Comm-bound

⚡ Optimal V in Practice

Megatron-LM experiments show $V = 2$ to $V = 4$ is usually optimal for inter-node pipeline parallelism. Beyond $V = 4$ , the communication overhead from extra activation transfers typically outweighs the bubble reduction. For intra-node pipelines over NVLink, higher $V$ can be beneficial because NVLink bandwidth is abundant.

DualPipe: Bidirectional Pipeline (DeepSeek-V3)

DeepSeek-V3 (Dai et al., 2024) introduced DualPipe, a bidirectional pipeline schedule that achieves near-zero bubble overhead. The core idea: send micro-batches from both ends of the pipeline simultaneously. Half the micro-batches flow forward (stage 0 to stage P-1) and the other half flow backward (stage P-1 to stage 0).

The insight is that when a GPU is executing the forward pass for a micro-batch flowing left-to-right, it can simultaneously execute the backward pass for a micro-batch flowing right-to-left. The forward computation of one micro-batch overlaps with the backward computation of another — they use different parts of the compute resources (forward is mostly matrix multiplications and activations; backward includes gradient computation and weight gradient accumulation).

DualPipe Schedule (P=4, simplified):

Left-to-right micro-batches: L0, L1, L2, L3
Right-to-left micro-batches: R0, R1, R2, R3

Time -->   |   1   |   2   |   3   |   4   |   5   |   6   |
Stage 0:   | F(L0) | F(L1) |F(L2)+B(R0)|F(L3)+B(R1)| B(R2) | B(R3) |
Stage 1:   |       | F(L0) |F(L1)+B(R1)|F(L2)+B(R0)| ...   |       |
Stage 2:   |       | F(R0) |F(R1)+B(L0)|F(R0)+B(L1)| ...   |       |
Stage 3:   | F(R0) | F(R1) |F(R2)+B(L0)|F(R3)+B(L1)| B(L2) | B(L3) |

In steady state, each GPU runs forward + backward simultaneously.
The bubble is only the initial fill and final drain.

The key innovation: in steady state, every GPU is executing both a forward pass and a backward pass at every time step. The forward pass for one direction’s micro-batch overlaps with the backward pass for the other direction’s micro-batch. This means the bubble is only in the pipeline fill and drain, which is minimal.

DualPipe achieves an effective bubble fraction that approaches:

$\text{Bubble fraction (DualPipe)} \approx \frac{P - 1}{2M}$

The factor of 2 improvement over standard 1F1B comes from the bidirectional overlap. For $P = 8$ , $M = 32$ :

$\text{Bubble} \approx \frac{7}{64} \approx 10.9\%$

But in practice, the overlap between forward and backward computation can hide even more of the bubble. DeepSeek reports near-zero effective bubble with sufficient micro-batches.

Bubble Comparison Across Pipeline Schedules (P=8, M=32)

(%)

Naive (no micro-batch) (P-1)/P

87.5 %

GPipe (P-1)/(P-1+M)

17.9 %

1F1B Same bubble, less memory

17.9 %

Interleaved V=4 (P-1)/(M*V)

5.5 %

DualPipe (P-1)/(2M), further hidden by overlap

10.9 %

class DualPipeSchedule:
    """
    DualPipe bidirectional pipeline schedule (DeepSeek-V3).

    Micro-batches flow from both ends simultaneously.
    Forward of left-to-right overlaps with backward of right-to-left.
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches
        # Split micro-batches into two streams
        self.M_left = num_microbatches // 2   # flow left-to-right
        self.M_right = num_microbatches // 2  # flow right-to-left

    def bubble_fraction(self) -> float:
        """Theoretical bubble before overlap hiding."""
        return (self.P - 1) / (2 * self.M)

    def effective_bubble(self, overlap_efficiency: float = 0.8) -> float:
        """
        Effective bubble after forward/backward overlap.
        overlap_efficiency: fraction of bubble hidden by overlapping
        forward and backward of different micro-batches.
        """
        theoretical = self.bubble_fraction()
        return theoretical * (1 - overlap_efficiency)

    def schedule_for_stage(self, stage_id: int):
        """
        Each stage handles two streams:
        - Left stream: micro-batches 0..M_left-1 flowing stage 0 -> P-1
        - Right stream: micro-batches 0..M_right-1 flowing stage P-1 -> 0

        In steady state, forward(left) overlaps with backward(right)
        and forward(right) overlaps with backward(left).
        """
        actions = []

        # Warmup: fill pipeline from both ends
        for t in range(stage_id):
            actions.append(('idle', t))

        # Steady state: overlapped forward + backward
        for mb in range(self.M_left):
            actions.append(('forward_left + backward_right', mb))

        # Cooldown: drain from both ends
        for t in range(self.P - 1 - stage_id):
            actions.append(('drain', t))

        return actions

ℹ️ DualPipe Requirements

DualPipe requires that forward and backward computation can overlap efficiently on the same GPU. This works well for transformer models where forward is dominated by matrix multiplications and backward by gradient matrix multiplications — they can overlap on different CUDA streams if the GPU has sufficient compute units and memory bandwidth. Models with heavy sequential dependencies or irregular computation patterns may not overlap as well.

Memory Analysis: The Activation Storage Problem

Every pipeline schedule must store activations (intermediate results from the forward pass) until the corresponding backward pass consumes them. The number of stored activations determines peak memory usage, which directly limits model size and micro-batch size.

Activation Memory Per Micro-Batch

For a transformer layer with hidden dimension $h$ , sequence length $s$ , and micro-batch size $b$ , the stored activations include:

Input to each layer: $b \times s \times h$ elements
Attention intermediate states: $b \times n_{heads} \times s \times s$ elements (for full attention)
MLP intermediate activations: $b \times s \times 4h$ elements (for standard FFN)
Layer norm inputs: $b \times s \times h$ elements

In BF16 (2 bytes per element), a single transformer layer’s activations are approximately:

$\text{Act per layer} \approx b \times s \times (10h + n_{heads} \times s) \times 2 \text{ bytes}$

For typical values ( $b = 1$ , $s = 4096$ , $h = 8192$ , $n_{heads} = 64$ ):

$\text{Act per layer} \approx 1 \times 4096 \times (81920 + 262144) \times 2 \approx 2.8 \text{ GB}$

If a pipeline stage has 10 layers, the activation memory per micro-batch per stage is roughly 28 GB. Now multiply by the number of in-flight micro-batches.

GPU Memory Budget Per Pipeline Stage (H100 80GB)

Memory allocation for a single pipeline stage in a 405B model across 8 GPUs

0x3200 0x0000

0x6400 0x3200

0x7D00 0x6400

0xA000 0x7D00

0xC800 0xA000

Model Parameters (BF16) ~12.5 GB

Optimizer States (Adam) ~25 GB

Gradients (BF16) ~12.5 GB

Activations (in-flight) ~20-30 GB

Available / Buffers ~0-10 GB

405B / 8 stages / 2 (TP within node) = ~25B params = 50 GB / 4 (with TP=4) = 12.5 GB

FP32 master weights + momentum + variance = 4x parameter memory in FP32

Same size as parameters

P micro-batch activations for 1F1B, or M for GPipe

Communication buffers, temporary tensors, CUDA overhead

Model Parameters (BF16) ~12.5 GB

Optimizer States (Adam) ~25 GB

Gradients (BF16) ~12.5 GB

Activations (in-flight) ~20-30 GB

Available / Buffers ~0-10 GB

📊

Activation Memory Per Stage: GPipe vs 1F1B (10 layers/stage, b=1, s=4096, h=8192)

Schedule	In-Flight MBs	Activation Memory	Feasible on 80GB H100?
GPipe M=4	4	~11 GB	Yes (tight)
GPipe M=16	16	~45 GB	No -- exceeds budget
GPipe M=32	32	~90 GB	No -- far exceeds
1F1B P=4	4	~11 GB	Yes
1F1B P=8	8	~22 GB	Yes
1F1B P=16	16	~45 GB	Marginal

Activation Checkpointing (Recomputation)

Activation checkpointing (also called gradient checkpointing or activation recomputation) is the standard solution to the activation memory problem. Instead of storing all intermediate activations, you store only the input activation at each layer boundary and recompute the intermediate activations during the backward pass.

The trade-off is straightforward: you trade compute for memory. With full activation checkpointing, you store only 1 activation tensor per layer instead of ~4-5, reducing activation memory by roughly 4-5x. The cost is recomputing the forward pass during backward, adding approximately 33% more total compute (the forward pass is about 1/3 of the total forward + backward compute).

class ActivationCheckpointing:
    """
    Activation checkpointing reduces memory by recomputing
    activations during backward pass instead of storing them.
    """
    def __init__(self, checkpoint_every_n_layers: int = 1):
        self.checkpoint_interval = checkpoint_every_n_layers

    def memory_reduction(self, layers_per_stage: int) -> float:
        """Factor of memory reduction vs no checkpointing."""
        # Without checkpointing: store all activations for all layers
        # With checkpointing: store only checkpoint activations
        checkpoints = layers_per_stage // self.checkpoint_interval
        # Each checkpoint stores only the input tensor (not intermediates)
        # Intermediates are ~4-5x the input tensor
        return layers_per_stage * 5 / (checkpoints + layers_per_stage)

    def compute_overhead(self) -> float:
        """Additional compute as a fraction of total."""
        # Recompute forward pass during backward
        # Forward = ~1/3 of total (forward + backward)
        # So recomputation adds ~33% to total compute
        return 0.33

    def forward_with_checkpointing(self, layers, input_tensor):
        """
        Only store activations at checkpoint boundaries.
        Recompute intermediates during backward.
        """
        checkpointed_activations = [input_tensor]
        x = input_tensor

        for i, layer in enumerate(layers):
            # Use torch.utils.checkpoint to wrap the forward
            # This stores input but not intermediate activations
            x = torch.utils.checkpoint.checkpoint(layer, x)

            if (i + 1) % self.checkpoint_interval == 0:
                checkpointed_activations.append(x)

        return x, checkpointed_activations

Selective activation checkpointing is a refinement: only recompute the cheap-to-recompute activations (like attention scores, which are memory-heavy but compute-cheap relative to matrix multiplications). This gives most of the memory savings with less compute overhead.

📊

Activation Checkpointing Trade-offs

Strategy	Memory Savings	Compute Overhead	Typical Use Case
No checkpointing	0%	0%	Small models, abundant memory
Full checkpointing	~80%	~33%	Very large models, memory-constrained
Selective (attn only)	~50-60%	~10-15%	Medium models, balanced trade-off
Every-2-layers	~50%	~16%	Moderate memory pressure

Combining Pipeline, Tensor, and Data Parallelism

In practice, large-scale training uses all three parallelism strategies simultaneously. The typical recipe for a cluster with $N$ total GPUs, where each node has $G$ GPUs connected by NVLink:

Tensor parallelism (TP) within a node: $TP = G$ (e.g., 8 GPUs per node). All-reduce operations stay on NVLink.
Pipeline parallelism (PP) across nodes: $PP =$ number of nodes in a pipeline. Point-to-point activation transfers go over InfiniBand.
Data parallelism (DP) across pipeline replicas: $DP = N / (TP \times PP)$ . Gradient all-reduce across DP replicas.

Total GPUs: $N = TP \times PP \times DP$ .

Example: Training Llama 405B on 1024 H100s (128 nodes, 8 GPUs/node)

TP = 8  (within each node, NVLink)
PP = 8  (across 8 nodes per pipeline, InfiniBand)
DP = 16 (16 pipeline replicas, InfiniBand)

Total = 8 * 8 * 16 = 1024 GPUs

Each pipeline replica:
  - 8 nodes, 64 GPUs
  - Each node holds one pipeline stage
  - Within each node, 8 GPUs share one stage via TP
  - Model: 126 layers / 8 stages = ~16 layers per stage
  - Each TP group of 8 GPUs holds 16 layers with weights sharded 8-way

Data parallelism:
  - 16 replicas of this 64-GPU pipeline
  - Global batch split 16 ways
  - Gradient all-reduce across 16 replicas after each mini-batch

3D Parallelism Layout: 1024 GPUs for 405B Model

TP=8 (intra-node NVLink), PP=8 (inter-node IB), DP=16 (across pods)

Tensor Parallel Group (TP=8) 8 GPUs within one node, connected by NVLink (900 GB/s) All-reduce per layer, highest communication volume

Pipeline Parallel Group (PP=8) 8 nodes (64 GPUs), connected by InfiniBand (400 Gb/s) Point-to-point activation transfers at stage boundaries

Data Parallel Group (DP=16) 16 pipeline replicas (1024 GPUs total) All-reduce gradients once per mini-batch, overlapped with compute

Communication Patterns in 3D Parallelism

Each parallelism dimension has a distinct communication pattern:

Tensor parallelism (most frequent, highest bandwidth):

2 all-reduce operations per transformer layer
Volume: $O(b \times s \times h)$ per all-reduce
Must complete within the compute time of one layer
Requires NVLink-class bandwidth (hundreds of GB/s)

Pipeline parallelism (moderate frequency, moderate bandwidth):

1 point-to-point transfer per pipeline stage boundary per micro-batch
Volume: $O(b \times s \times h)$ per transfer
Can overlap with compute at adjacent stages
InfiniBand sufficient (tens of GB/s)

Data parallelism (least frequent, can overlap):

1 all-reduce of all gradients per mini-batch
Volume: $O(\text{total parameters})$
Can overlap with backward pass computation (ring all-reduce)
InfiniBand sufficient, can use network efficiently

class ThreeDParallelConfig:
    """
    Configuration for 3D parallelism: TP + PP + DP.
    """
    def __init__(self, tp: int, pp: int, dp: int,
                 gpus_per_node: int = 8):
        self.tp = tp
        self.pp = pp
        self.dp = dp
        self.total_gpus = tp * pp * dp
        self.gpus_per_node = gpus_per_node
        self.nodes = self.total_gpus // gpus_per_node

        # Validate: TP should fit within a node
        assert tp <= gpus_per_node, \
            f"TP={tp} exceeds GPUs per node={gpus_per_node}"

    def communication_volume_per_step(self, model_params: int,
                                       hidden: int, seq_len: int,
                                       micro_batch: int,
                                       layers_per_stage: int) -> dict:
        """Estimate communication volumes per training step."""
        activation_size = micro_batch * seq_len * hidden * 2  # BF16

        # TP: 2 all-reduces per layer, within node
        tp_volume = 2 * layers_per_stage * activation_size * (self.tp - 1) / self.tp
        tp_link = "NVLink"

        # PP: 1 point-to-point per stage boundary per micro-batch
        # Number of micro-batches depends on schedule
        pp_volume = activation_size  # per micro-batch per boundary
        pp_link = "InfiniBand" if self.tp == self.gpus_per_node else "NVLink"

        # DP: all-reduce of all gradients, once per mini-batch
        dp_volume = model_params * 2  # BF16 gradients
        dp_link = "InfiniBand"

        return {
            'tp': {'volume_bytes': tp_volume, 'link': tp_link,
                   'frequency': 'per layer per micro-batch'},
            'pp': {'volume_bytes': pp_volume, 'link': pp_link,
                   'frequency': 'per micro-batch per boundary'},
            'dp': {'volume_bytes': dp_volume, 'link': dp_link,
                   'frequency': 'once per mini-batch (overlapped)'},
        }

📊

Communication Volume by Parallelism Dimension (405B, h=13824, s=4096, b=1)

Dimension	Volume Per Event	Frequency	Link Type	Latency Sensitivity
TP (all-reduce)	~214 MB	2x per layer per MB	NVLink	Very high
PP (point-to-point)	~107 MB	1x per stage boundary	InfiniBand	Moderate
DP (all-reduce)	~810 GB	1x per mini-batch	InfiniBand	Low (overlapped)

Expert Parallelism and MoE

For Mixture-of-Experts models like DeepSeek-V3, there is a fourth dimension: expert parallelism (EP). Experts are distributed across GPUs, and an all-to-all communication pattern routes tokens to the correct expert. This adds another layer of complexity to the parallelism strategy but is orthogonal to pipeline parallelism.

Practical Concerns: Load Balancing Pipeline Stages

For pipeline parallelism to be efficient, every stage must take approximately the same time to compute. If one stage takes twice as long, every other stage waits for it, creating a bottleneck worse than the inherent bubble.

Transformer models are relatively easy to balance because every layer has the same architecture and roughly the same compute cost. You simply assign equal numbers of layers per stage: $L / P$ layers each. The first and last stages may have slight additional work (embedding lookup, output projection), but this is minor.

Non-uniform architectures (models with varying layer sizes, skip connections across stages, or different computation at different depths) are harder to balance. Profiling each layer’s compute time and partitioning to equalize stage times is necessary.

def balance_pipeline_stages(layer_compute_times: list, num_stages: int) -> list:
    """
    Partition layers into stages with balanced compute times.
    Uses dynamic programming to minimize the maximum stage time.

    Args:
        layer_compute_times: list of per-layer compute times in ms
        num_stages: number of pipeline stages P

    Returns:
        list of (start_layer, end_layer) tuples per stage
    """
    n = len(layer_compute_times)
    prefix_sum = [0] * (n + 1)
    for i in range(n):
        prefix_sum[i + 1] = prefix_sum[i] + layer_compute_times[i]

    # dp[i][j] = min possible max-stage-time using i stages for first j layers
    INF = float('inf')
    dp = [[INF] * (n + 1) for _ in range(num_stages + 1)]
    dp[0][0] = 0

    for s in range(1, num_stages + 1):
        for j in range(s, n + 1):
            for k in range(s - 1, j):
                stage_time = prefix_sum[j] - prefix_sum[k]
                dp[s][j] = min(dp[s][j], max(dp[s-1][k], stage_time))

    # Backtrack to find partition
    # (omitted for brevity -- standard DP backtracking)
    return partition_from_dp(dp, prefix_sum, num_stages)

⚠️ Embedding and Output Layers

The first pipeline stage typically includes the token embedding layer, and the last stage includes the output projection (language model head). For large-vocabulary models (e.g., 128K tokens), the embedding and output projection can be significant: a 128K vocabulary with hidden size 8192 in BF16 is about 2 GB. This asymmetry means the first and last stages have more memory pressure and slightly different compute profiles. Some implementations split the embedding across stages or tie the embedding and output projection weights to mitigate this.

When Pipeline Parallelism Is the Wrong Choice

Pipeline parallelism is not always the answer. There are scenarios where it introduces more overhead than it solves:

Models That Fit on One Node

If a model fits within a single node’s aggregate GPU memory (e.g., a 70B model on 8 H100s with 640 GB total), tensor parallelism alone is usually better. TP within a node over NVLink has:

Lower latency than PP’s pipeline bubble
No idle time from fill/drain phases
Simpler implementation and debugging

Pipeline parallelism only becomes necessary when you need to span multiple nodes, where TP’s all-reduce latency over InfiniBand becomes prohibitive.

Very Small Micro-Batch Sizes

When the global batch size is small or cannot be divided into enough micro-batches, the bubble dominates. If $M$ is close to $P$ , the bubble fraction approaches 50% even with 1F1B:

$\frac{P - 1}{P - 1 + M} \xrightarrow{M = P} \frac{P - 1}{2P - 1} \approx 50\%$

At small $M$ , you are better off using fewer pipeline stages (accepting less parallelism) or using other strategies entirely.

Latency-Sensitive Inference

Pipeline parallelism is designed for throughput, not latency. For a single request:

The request must pass through all $P$ stages sequentially
Latency = $P \times \text{per-stage latency} + (P-1) \times \text{inter-stage communication}$
No opportunity for micro-batching (only one sequence)

For inference, tensor parallelism is preferred because it reduces per-request latency by parallelizing each layer’s computation. Pipeline parallelism can help inference throughput when processing many concurrent requests (similar to micro-batching), but for latency-critical single-request scenarios, it adds overhead.

Heterogeneous Hardware

Pipeline parallelism assumes all stages take equal time. If your cluster has mixed GPU generations (e.g., some A100s and some H100s), the slowest GPU becomes the bottleneck for every stage. DP is more forgiving of heterogeneous hardware because slower workers simply process smaller batch portions.

📊

When to Use Each Parallelism Strategy

Scenario	Best Strategy	Why Not PP?	Recommendation
70B on 8xH100 (1 node)	TP=8	No bubble, NVLink fast enough	TP only
405B on 64 GPUs (8 nodes)	TP=8, PP=8	Need PP for inter-node	TP + PP
405B on 1024 GPUs	TP=8, PP=8, DP=16	Full 3D parallelism	TP + PP + DP
7B model, many GPUs	DP only	Model fits on 1 GPU, PP bubble wasted	DP
Single-request inference	TP	PP adds latency, no micro-batching	TP
High-throughput inference	TP + PP	PP helps throughput with batched requests	TP + PP

Summary of Pipeline Schedules

The evolution of pipeline parallelism schedules represents steady progress toward eliminating the bubble:

📊

Pipeline Schedule Evolution

Schedule	Year	Bubble Formula	Peak Activations	Key Innovation
Naive	N/A	(P-1)/P	1	No micro-batching
GPipe	2019	(P-1)/(P-1+M)	M	Micro-batch pipelining
1F1B (PipeDream-Flush)	2019	(P-1)/(P-1+M)	P	Interleaved F/B, less memory
Interleaved (Megatron)	2021	(P-1)/(M*V)	P	Virtual stages, smaller bubble
DualPipe (DeepSeek-V3)	2024	~(P-1)/(2M)	~2P	Bidirectional, overlap F+B

def compare_schedules(P: int, M: int, V: int = 1) -> dict:
    """
    Compare bubble fractions and memory requirements
    across pipeline parallelism schedules.

    Args:
        P: number of pipeline stages
        M: number of micro-batches
        V: number of virtual stages (for interleaved schedule)
    """
    return {
        'naive': {
            'bubble': (P - 1) / P,
            'peak_activations': 1,
            'description': 'Single batch, no pipelining'
        },
        'gpipe': {
            'bubble': (P - 1) / (P - 1 + M),
            'peak_activations': M,
            'description': 'All forwards then all backwards'
        },
        '1f1b': {
            'bubble': (P - 1) / (P - 1 + M),
            'peak_activations': P,
            'description': 'Alternating forward and backward'
        },
        'interleaved': {
            'bubble': (P - 1) / (M * V),
            'peak_activations': P,
            'communication_factor': V,
            'description': f'Virtual stages V={V}, more comm but less bubble'
        },
        'dualpipe': {
            'bubble': (P - 1) / (2 * M),
            'peak_activations': 2 * P,  # approximate
            'description': 'Bidirectional, forward+backward overlap'
        }
    }

# Example: P=8 stages, M=32 micro-batches, V=4 virtual stages
results = compare_schedules(P=8, M=32, V=4)
# naive:       87.5% bubble
# gpipe:       17.9% bubble, 32 activations
# 1f1b:        17.9% bubble, 8 activations
# interleaved: 5.5% bubble, 8 activations, 4x comm
# dualpipe:    10.9% bubble (before overlap hiding), ~16 activations

Effective Throughput Relative to Ideal (P=8, M=32)

(% of ideal)

Ideal (no bubble) Theoretical max

100 % of ideal

Naive 87.5% bubble

12.5 % of ideal

GPipe 17.9% bubble

82.1 % of ideal

1F1B Same throughput, 4x less memory

82.1 % of ideal

Interleaved V=4 5.5% bubble

94.5 % of ideal

DualPipe ~4% effective with overlap

96 % of ideal

Implementation Reference: Frameworks and Their Schedules

Major distributed training frameworks implement different subsets of these schedules:

Megatron-LM (NVIDIA): Supports 1F1B and interleaved schedules. The de facto standard for large-scale LLM training. Tight integration with TP and DP.

DeepSpeed (Microsoft): Supports GPipe-style and 1F1B schedules via its pipeline engine. Part of the ZeRO ecosystem for memory optimization.

FairScale / PyTorch FSDP: PyTorch-native pipeline parallelism with 1F1B schedule. Integrates with FSDP for data parallelism.

DeepSeek’s training infrastructure: Custom DualPipe implementation. Not yet available as a general-purpose library (as of early 2025).

# Megatron-LM example configuration for 3D parallelism
megatron_config = {
    'tensor_model_parallel_size': 8,        # TP = 8 (intra-node)
    'pipeline_model_parallel_size': 8,       # PP = 8 (inter-node)
    'data_parallel_size': 16,                # DP = 16
    'virtual_pipeline_model_parallel_size': 4,  # V = 4 (interleaved)
    'num_micro_batches': 32,                 # M = 32
    'global_batch_size': 2048,               # split across DP replicas
    'micro_batch_size': 4,                   # per micro-batch
    'sequence_length': 4096,
    'num_layers': 126,                       # Llama 405B
    'hidden_size': 13824,
    'num_attention_heads': 128,
    'fp16': False,
    'bf16': True,
    'recompute_activations': True,           # activation checkpointing
    'recompute_granularity': 'selective',     # only recompute attention
}

Conclusion

Pipeline parallelism exists to solve a specific problem: training models that exceed single-node memory while avoiding the bandwidth penalty of cross-node tensor parallelism. The core challenge is the pipeline bubble — idle GPU time during pipeline fill and drain.

The evolution from naive pipelining through GPipe, 1F1B, interleaved stages, and DualPipe represents a progression from 87.5% idle time to under 5% effective idle time for typical configurations. Each advance either shrinks the bubble (more micro-batches, virtual stages, bidirectional flow) or reduces the memory cost of shrinking it (1F1B’s bounded activation storage, activation checkpointing).

The practical recipe for large-scale training has converged: tensor parallelism within a node (where NVLink provides the bandwidth for frequent all-reduce), pipeline parallelism across nodes (where only activation tensors cross the slower InfiniBand links), and data parallelism across pipeline replicas (where gradient all-reduce can overlap with backward computation). Understanding the trade-offs between bubble fraction, activation memory, and communication volume is essential for configuring these systems efficiently.

The bubble is the tax you pay for splitting a model across devices. The history of pipeline parallelism is the history of minimizing that tax.

Communication Volume Per Training Step (80-layer Transformer, h=8192)

The Bubble Problem

Naive Pipeline Bubble Fraction by Number of Stages

GPipe: Micro-Batching to Shrink the Bubble

GPipe Bubble Fraction: Effect of Micro-batch Count M

GPipe’s Gradient Accumulation Semantics

PipeDream 1F1B: Interleaving Forward and Backward

The Memory Advantage of 1F1B

Activation Memory: GPipe vs 1F1B (P=4 stages)

Memory Comparison: GPipe vs 1F1B

Interleaved Pipeline Stages (Megatron-LM)

Bubble Fraction: Standard vs Interleaved Pipeline (P=8)

The Communication Cost of Interleaving

Interleaved Pipeline: Bubble vs Communication Trade-off (P=8, M=32)

DualPipe: Bidirectional Pipeline (DeepSeek-V3)

Bubble Comparison Across Pipeline Schedules (P=8, M=32)

Memory Analysis: The Activation Storage Problem

Activation Memory Per Micro-Batch

GPU Memory Budget Per Pipeline Stage (H100 80GB)

Activation Memory Per Stage: GPipe vs 1F1B (10 layers/stage, b=1, s=4096, h=8192)

Activation Checkpointing (Recomputation)

Activation Checkpointing Trade-offs

Combining Pipeline, Tensor, and Data Parallelism

3D Parallelism Layout: 1024 GPUs for 405B Model

Communication Patterns in 3D Parallelism

Communication Volume by Parallelism Dimension (405B, h=13824, s=4096, b=1)

Expert Parallelism and MoE

Practical Concerns: Load Balancing Pipeline Stages

When Pipeline Parallelism Is the Wrong Choice

Models That Fit on One Node

Very Small Micro-Batch Sizes

Latency-Sensitive Inference

Heterogeneous Hardware

When to Use Each Parallelism Strategy

Summary of Pipeline Schedules

Pipeline Schedule Evolution

Effective Throughput Relative to Ideal (P=8, M=32)

Implementation Reference: Frameworks and Their Schedules

Conclusion

Stanley Phoong

Related Posts

The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training

Compute-Communication Overlap: Hiding Distributed Training Latency

DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale