Part of Series Inference Optimization Timeline 43 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Llama 3 405B requires 810 GB just for weights in BF16. An H100 has 80 GB. The model physically cannot fit on one GPU, so you must distribute it across multiple devices. Tensor parallelism splits every operation across GPUs with an all-reduce per layer — fast within a node over NVLink, catastrophic across nodes over InfiniBand because 160 all-reduces per forward pass drowns the network. Pipeline parallelism splits the model by layers: Device 0 gets layers 0-9, Device 1 gets layers 10-19, and only activation tensors cross device boundaries. Communication volume drops by 20x, but you introduce pipeline bubbles where GPUs sit idle waiting for the next micro-batch. The best schedules — PipeDream 1F1B, Megatron interleaved, DeepSeek DualPipe — minimize bubbles to under 10%. This post covers the algorithms, the bubble math, and when PP beats TP.

There are three fundamental strategies for distributing a model across devices:

  1. Data parallelism (DP) — replicate the full model on every device, split the data. This does not help when the model itself exceeds one device’s memory.
  2. Tensor parallelism (TP) — split individual operations (matrix multiplications, attention heads) across devices. Every operation requires an all-reduce across participating GPUs, so it demands extremely high-bandwidth interconnect. Within a single node using NVLink (900 GB/s bidirectional on H100), TP works well. Across nodes over InfiniBand (50-100 GB/s per link), the latency and bandwidth penalty is severe.
  3. Pipeline parallelism (PP) — split the model by layers. Device 0 holds layers 0-9, device 1 holds layers 10-19, and so on. Only activation tensors cross device boundaries, and they cross in one direction at a time. The communication volume is dramatically smaller than TP’s all-reduce pattern.
ℹ️ Why Not Just Use Tensor Parallelism Everywhere?

Tensor parallelism requires two all-reduce operations per transformer layer (one for the attention projection, one for the MLP). For a model with hidden dimension h=8192h = 8192 and sequence length s=4096s = 4096, each all-reduce communicates O(s×h)O(s \times h) bytes. Over InfiniBand at 50 GB/s effective bandwidth, this adds 100-200 microseconds of latency per layer. With 80+ layers, cross-node TP can spend more time communicating than computing. Pipeline parallelism replaces this with a single point-to-point transfer of activations between adjacent stages — orders of magnitude less communication.

Pipeline parallelism is the natural choice for inter-node distribution. You assign consecutive layers to each node, and the only data that crosses the (slow) inter-node network is the activation tensor at layer boundaries. For a transformer with hidden size hh and micro-batch size bb with sequence length ss, the activation tensor at each boundary is b×s×hb \times s \times h elements. Compare this to TP, which would require all-reduce of the same size tensor twice per layer across all nodes.

📊

Communication Volume Per Training Step (80-layer Transformer, h=8192)

StrategyNodesPer-Layer CommTotal CommInterconnect Need
TP across 8 nodes 8 2 all-reduce of b*s*h 160 all-reduces NVLink-class
PP across 8 nodes 8 1 point-to-point of b*s*h 7 point-to-point InfiniBand OK
TP intra-node + PP inter-node 8 2 all-reduce (local) + 1 p2p Hybrid Optimal

But pipeline parallelism has a fundamental problem: the pipeline bubble.

The Bubble Problem

Consider a naive pipeline with P=4P = 4 stages processing a single mini-batch. Stage 0 computes the forward pass for layers 0-19, then sends the activation to stage 1. Stage 1 computes layers 20-39, sends to stage 2, and so on. During this entire forward cascade, stages 1, 2, and 3 are idle while waiting for their input. Then during the backward pass, stages cascade in reverse. The result is that each stage is active for only a small fraction of the total time.

Naive Pipeline (P=4, single batch):

Time -->  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
Stage 0:  | F0  |     |     |     |     |     |     | B0  |
Stage 1:  |     | F0  |     |     |     |     | B0  |     |
Stage 2:  |     |     | F0  |     |     | B0  |     |     |
Stage 3:  |     |     |     | F0  | B0  |     |     |     |

F0 = forward micro-batch 0, B0 = backward micro-batch 0
Empty cells = GPU sitting idle (the "bubble")

The idle fraction in this naive schedule is:

Bubble fraction=P1P\text{Bubble fraction} = \frac{P - 1}{P}

For P=8P = 8 pipeline stages, this means 7/8=87.5%7/8 = 87.5\% of GPU time is wasted. You are paying for 8 GPUs but getting the throughput of slightly more than 1. This is unacceptable.

Naive Pipeline Bubble Fraction by Number of Stages

(%)
P=2 50% idle
50 %
P=4 75% idle
75 %
+50.0%
P=8 87.5% idle
87.5 %
+75.0%
P=16 93.75% idle
93.75 %
+87.5%
P=32 96.88% idle
96.88 %
+93.8%

The entire history of pipeline parallelism research is, in essence, the quest to shrink this bubble.

GPipe: Micro-Batching to Shrink the Bubble

GPipe (Huang et al., 2019) introduced the foundational insight: split the mini-batch into MM micro-batches and pipeline them through the stages. Instead of waiting for the entire forward pass to complete before starting the backward pass, you feed micro-batches into the pipeline one after another. While stage 0 is processing micro-batch 1, stage 1 is processing micro-batch 0 from the previous time step.

GPipe Schedule (P=4, M=8 micro-batches):

Time -->  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | ... | 19 |
Stage 0:  | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |    |    |     | B0 |
Stage 1:  |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |    |     |    |
Stage 2:  |    |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |    |     |    |
Stage 3:  |    |    |    | F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | B7  |    |

All forward passes complete, then all backward passes run.
Bubble is only the fill and drain phases: (P-1) time slots at each end.

The key schedule is: all MM micro-batches execute their forward passes in pipeline fashion, then all MM micro-batches execute their backward passes. Gradients are accumulated across micro-batches and the optimizer step happens once per mini-batch.

The bubble fraction becomes:

Bubble fraction (GPipe)=P1P1+M\text{Bubble fraction (GPipe)} = \frac{P - 1}{P - 1 + M}

This is a dramatic improvement. With P=8P = 8 stages and M=32M = 32 micro-batches:

Bubble=77+32=73917.9%\text{Bubble} = \frac{7}{7 + 32} = \frac{7}{39} \approx 17.9\%

Compare this to 87.5% with the naive approach.

📊

GPipe Bubble Fraction: Effect of Micro-batch Count M

Micro-batches (M)P=4P=8P=16P=32
M=4 42.9% 63.6% 78.9% 88.6%
M=8 27.3% 46.7% 65.2% 79.5%
M=16 15.8% 30.4% 48.4% 66.0%
M=32 8.6% 17.9% 32.6% 49.2%
M=64 4.5% 9.9% 19.0% 32.6%
M=128 2.3% 5.2% 10.5% 19.5%
class GPipeSchedule:
    """
    GPipe micro-batch pipeline schedule.

    All forwards run first (fill phase), then all backwards (drain phase).
    Gradient accumulation happens across micro-batches.
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches

    def bubble_fraction(self) -> float:
        return (self.P - 1) / (self.P - 1 + self.M)

    def execute(self, mini_batch):
        micro_batches = split(mini_batch, self.M)

        # Phase 1: All forward passes (pipeline fill + steady state)
        # Each micro-batch flows through all P stages
        activations = {}  # (stage, mb_idx) -> activation tensor
        for mb_idx in range(self.M):
            for stage in range(self.P):
                input_act = micro_batches[mb_idx] if stage == 0 \
                           else activations[(stage - 1, mb_idx)]
                activations[(stage, mb_idx)] = self.forward(stage, input_act)

        # Phase 2: All backward passes (pipeline drain)
        # Must store ALL activations from forward phase -- memory intensive
        accumulated_grads = [zeros_like(p) for p in self.parameters()]
        for mb_idx in reversed(range(self.M)):
            for stage in reversed(range(self.P)):
                grads = self.backward(stage, activations[(stage, mb_idx)])
                accumulate(accumulated_grads, grads)

        # Phase 3: Single optimizer step
        self.optimizer_step(accumulated_grads)
⚠️ GPipe Memory Problem

GPipe must store activations for ALL MM micro-batches across ALL PP stages simultaneously during the forward phase, because backward passes do not begin until all forward passes complete. For M=32M = 32 micro-batches with hidden size h=8192h = 8192 and sequence length s=4096s = 4096 in BF16, each activation tensor is b×s×h×2b \times s \times h \times 2 bytes. Even with micro-batch size b=1b = 1, that is 64 MB per activation. Storing 32 of these per stage means 2 GB of activation memory per stage — and this is just one layer boundary. The actual activation memory across all layers in a stage is much larger.

GPipe’s Gradient Accumulation Semantics

A subtle but important property of GPipe: because all forward passes complete before any backward pass begins, the gradient computation is mathematically identical to processing the full mini-batch at once (up to floating-point ordering differences). Each micro-batch’s forward pass uses the same model weights. The accumulated gradients across micro-batches equal the gradient of the full mini-batch loss.

This is a clean semantic guarantee. The training dynamics are identical to single-device training with the same effective batch size. No weight staleness, no approximation.

PipeDream 1F1B: Interleaving Forward and Backward

PipeDream (Narayanan et al., 2019) and its successor PipeDream-Flush (also called the “1F1B schedule”) address GPipe’s memory problem by interleaving forward and backward passes. Instead of completing all forwards before starting any backward, each stage alternates: one forward, one backward, one forward, one backward.

1F1B Schedule (P=4, M=8 micro-batches):

Time -->  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | ...
Stage 0:  | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | F7 | ...
Stage 1:  |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | B3 | ...
Stage 2:  |    |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | F6 | ...
Stage 3:  |    |    |    | F0 | F1 | F2 | F3 | B0 | F4 | B1 | F5 | B2 | ...

Warmup: each stage does (P - stage_id - 1) extra forwards before entering steady state.
Steady state: strictly alternating 1 forward + 1 backward.
Cooldown: remaining backwards drain the pipeline.

The schedule has three phases:

  1. Warmup phase: Stage ii performs (P1i)(P - 1 - i) forward passes to fill the pipeline. Stage 0 does (P1)(P-1) forwards, stage 1 does (P2)(P-2), and so on. The last stage does 0 warmup forwards.
  2. Steady-state phase: Each stage alternates exactly one forward pass and one backward pass (hence “1F1B”). This is the bulk of the computation.
  3. Cooldown phase: Remaining backward passes drain the pipeline.

The bubble fraction is the same as GPipe: (P1)/(P1+M)(P-1)/(P-1+M). The critical difference is in memory.

The Memory Advantage of 1F1B

In GPipe, all MM forward passes complete before any backward pass starts. This means every stage must store activations for all MM micro-batches simultaneously. The peak activation memory per stage scales as O(M)O(M).

In 1F1B, once a stage enters steady state, it performs a backward pass immediately after each forward pass. The backward pass for micro-batch kk frees the activations for micro-batch kk. At any point during steady state, a stage has at most PP micro-batches’ activations in flight — one from each of the stages ahead of it in the warmup, plus the current one.

Peak activations (GPipe)=M micro-batches\text{Peak activations (GPipe)} = M \text{ micro-batches} Peak activations (1F1B)=P micro-batches\text{Peak activations (1F1B)} = P \text{ micro-batches}

Since typically MPM \gg P (you want many micro-batches to reduce the bubble), this is a major memory saving.

Activation Memory: GPipe vs 1F1B (P=4 stages)

Peak number of micro-batch activations stored simultaneously per stage

GPipe: M=32 micro-batch activations stored 32 x activation_size per stage All forwards complete before any backward begins
1F1B: P=4 micro-batch activations stored 4 x activation_size per stage Backward frees memory as soon as forward finishes
Memory savings with 1F1B 8x reduction (32/4) in activation memory Enables larger micro-batches or larger models
class OneFOneBSchedule:
    """
    1F1B (One Forward One Backward) pipeline schedule.

    Key insight: interleaving forward and backward passes limits
    the number of in-flight micro-batches to P (not M).
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches

    def peak_activations(self) -> int:
        """Max micro-batch activations stored at any stage."""
        return self.P  # Not M! This is the key advantage.

    def schedule_for_stage(self, stage_id: int):
        """Generate the action sequence for a single stage."""
        warmup_forwards = self.P - 1 - stage_id
        steady_state_pairs = self.M - warmup_forwards
        cooldown_backwards = warmup_forwards

        actions = []

        # Phase 1: Warmup -- only forwards
        for i in range(warmup_forwards):
            actions.append(('forward', i))

        # Phase 2: Steady state -- alternating 1F1B
        fwd_idx = warmup_forwards
        bwd_idx = 0
        for _ in range(steady_state_pairs):
            actions.append(('forward', fwd_idx))
            actions.append(('backward', bwd_idx))
            fwd_idx += 1
            bwd_idx += 1

        # Phase 3: Cooldown -- only backwards
        for i in range(cooldown_backwards):
            actions.append(('backward', bwd_idx + i))

        return actions
📊

Memory Comparison: GPipe vs 1F1B

ConfigGPipe Activations1F1B ActivationsMemory Reduction
P=4, M=16 16 per stage 4 per stage 4x
P=4, M=32 32 per stage 4 per stage 8x
P=8, M=32 32 per stage 8 per stage 4x
P=8, M=64 64 per stage 8 per stage 8x
P=16, M=128 128 per stage 16 per stage 8x
💡 Weight Staleness in Original PipeDream

The original PipeDream paper proposed asynchronous weight updates — each stage would apply its gradient immediately without waiting for other stages. This introduces weight staleness: forward and backward passes for the same micro-batch may use different weight versions. PipeDream-Flush (and Megatron-LM’s 1F1B implementation) eliminated this by using synchronous gradient accumulation, exactly like GPipe. The 1F1B schedule as used in practice today does NOT have weight staleness.

Interleaved Pipeline Stages (Megatron-LM)

Narayanan et al. (2021) in the Megatron-LM paper introduced interleaved pipeline stages, also known as “virtual pipeline stages.” The idea: instead of assigning each GPU a single contiguous block of layers, assign each GPU VV non-consecutive blocks (“virtual stages”).

For example, with P=4P = 4 GPUs, L=32L = 32 layers, and V=2V = 2 virtual stages per GPU:

Standard pipeline (V=1):
  GPU 0: layers  0- 7   (stage 0)
  GPU 1: layers  8-15   (stage 1)
  GPU 2: layers 16-23   (stage 2)
  GPU 3: layers 24-31   (stage 3)

Interleaved pipeline (V=2):
  GPU 0: layers  0- 3   (virtual stage 0) + layers 16-19 (virtual stage 4)
  GPU 1: layers  4- 7   (virtual stage 1) + layers 20-23 (virtual stage 5)
  GPU 2: layers  8-11   (virtual stage 2) + layers 24-27 (virtual stage 6)
  GPU 3: layers 12-15   (virtual stage 3) + layers 28-31 (virtual stage 7)

The total number of virtual pipeline stages is P×VP \times V. The pipeline sees P×VP \times V stages, but each “stage” is smaller (fewer layers). A micro-batch passes through all P×VP \times V virtual stages, bouncing between GPUs multiple times. The critical insight: each virtual stage takes 1/V1/V the time to compute, so the pipeline fill and drain time is divided by VV.

The bubble fraction becomes:

Bubble fraction (interleaved)=P1M×V\text{Bubble fraction (interleaved)} = \frac{P - 1}{M \times V}

Note that VV appears in the denominator multiplicatively. With P=8P = 8, M=32M = 32, V=4V = 4:

Bubble=732×4=71285.5%\text{Bubble} = \frac{7}{32 \times 4} = \frac{7}{128} \approx 5.5\%

Compare this to GPipe’s 7/3917.9%7/39 \approx 17.9\% with the same MM.

Bubble Fraction: Standard vs Interleaved Pipeline (P=8)

(%)
GPipe M=32 (P-1)/(P-1+M)
17.9 %
Interleaved V=2, M=32 (P-1)/(M*V)
10.9 %
Interleaved V=4, M=32 (P-1)/(M*V)
5.5 %
Interleaved V=8, M=32 (P-1)/(M*V)
2.7 %
Interleaved V=4, M=64 (P-1)/(M*V)
2.7 %

The Communication Cost of Interleaving

There is no free lunch. With VV virtual stages per GPU, a micro-batch must transfer activations between GPUs P×V1P \times V - 1 times instead of P1P - 1 times. Each transfer is smaller (fewer layers means smaller activation tensors — actually the same size since it is the same hidden dimension, but the number of transfers increases by VV).

More precisely, with standard pipeline, there are P1P - 1 inter-device activation transfers per micro-batch per direction. With interleaved pipeline (V virtual stages), there are P×V1P \times V - 1 transfers. The volume per transfer is identical (same hidden dimension), so total communication increases by roughly a factor of VV.

This trade-off works well when:

  • The network bandwidth between GPUs is high relative to compute time per stage
  • The bubble reduction from interleaving outweighs the communication overhead
  • You use NVLink within a node (where bandwidth is abundant)

It works poorly when:

  • Stages are already communication-bound
  • Inter-node bandwidth is limited (the extra transfers saturate the link)
class InterleavedPipelineSchedule:
    """
    Interleaved (virtual) pipeline stages as in Megatron-LM.

    Each GPU holds V non-consecutive chunks of layers.
    Bubble fraction: (P-1) / (M*V) instead of (P-1) / (P-1+M).
    Communication increases by factor V.
    """
    def __init__(self, num_gpus: int, virtual_stages: int, num_microbatches: int):
        self.P = num_gpus
        self.V = virtual_stages
        self.M = num_microbatches
        self.total_virtual_stages = num_gpus * virtual_stages

    def bubble_fraction(self) -> float:
        return (self.P - 1) / (self.M * self.V)

    def communication_overhead_ratio(self) -> float:
        """Ratio of communication vs standard pipeline."""
        standard_transfers = self.P - 1
        interleaved_transfers = self.P * self.V - 1
        return interleaved_transfers / standard_transfers

    def layer_assignment(self, gpu_id: int, total_layers: int) -> list:
        """Which layers does this GPU hold?"""
        layers_per_virtual_stage = total_layers // self.total_virtual_stages
        assigned = []
        for v in range(self.V):
            virtual_stage_id = gpu_id + v * self.P
            start = virtual_stage_id * layers_per_virtual_stage
            end = start + layers_per_virtual_stage
            assigned.extend(range(start, end))
        return assigned
📊

Interleaved Pipeline: Bubble vs Communication Trade-off (P=8, M=32)

Virtual Stages (V)Bubble %Comm Overhead vs V=1Net Throughput Gain
V=1 (standard) 17.9% 1.0x Baseline
V=2 10.9% 2.1x +5-8%
V=4 5.5% 4.4x +8-12%
V=8 2.7% 9.0x +5-10%
V=16 1.4% 18.1x Comm-bound
Optimal V in Practice

Megatron-LM experiments show V=2V = 2 to V=4V = 4 is usually optimal for inter-node pipeline parallelism. Beyond V=4V = 4, the communication overhead from extra activation transfers typically outweighs the bubble reduction. For intra-node pipelines over NVLink, higher VV can be beneficial because NVLink bandwidth is abundant.

DualPipe: Bidirectional Pipeline (DeepSeek-V3)

DeepSeek-V3 (Dai et al., 2024) introduced DualPipe, a bidirectional pipeline schedule that achieves near-zero bubble overhead. The core idea: send micro-batches from both ends of the pipeline simultaneously. Half the micro-batches flow forward (stage 0 to stage P-1) and the other half flow backward (stage P-1 to stage 0).

The insight is that when a GPU is executing the forward pass for a micro-batch flowing left-to-right, it can simultaneously execute the backward pass for a micro-batch flowing right-to-left. The forward computation of one micro-batch overlaps with the backward computation of another — they use different parts of the compute resources (forward is mostly matrix multiplications and activations; backward includes gradient computation and weight gradient accumulation).

DualPipe Schedule (P=4, simplified):

Left-to-right micro-batches: L0, L1, L2, L3
Right-to-left micro-batches: R0, R1, R2, R3

Time -->   |   1   |   2   |   3   |   4   |   5   |   6   |
Stage 0:   | F(L0) | F(L1) |F(L2)+B(R0)|F(L3)+B(R1)| B(R2) | B(R3) |
Stage 1:   |       | F(L0) |F(L1)+B(R1)|F(L2)+B(R0)| ...   |       |
Stage 2:   |       | F(R0) |F(R1)+B(L0)|F(R0)+B(L1)| ...   |       |
Stage 3:   | F(R0) | F(R1) |F(R2)+B(L0)|F(R3)+B(L1)| B(L2) | B(L3) |

In steady state, each GPU runs forward + backward simultaneously.
The bubble is only the initial fill and final drain.

The key innovation: in steady state, every GPU is executing both a forward pass and a backward pass at every time step. The forward pass for one direction’s micro-batch overlaps with the backward pass for the other direction’s micro-batch. This means the bubble is only in the pipeline fill and drain, which is minimal.

DualPipe achieves an effective bubble fraction that approaches:

Bubble fraction (DualPipe)P12M\text{Bubble fraction (DualPipe)} \approx \frac{P - 1}{2M}

The factor of 2 improvement over standard 1F1B comes from the bidirectional overlap. For P=8P = 8, M=32M = 32:

Bubble76410.9%\text{Bubble} \approx \frac{7}{64} \approx 10.9\%

But in practice, the overlap between forward and backward computation can hide even more of the bubble. DeepSeek reports near-zero effective bubble with sufficient micro-batches.

Bubble Comparison Across Pipeline Schedules (P=8, M=32)

(%)
Naive (no micro-batch) (P-1)/P
87.5 %
GPipe (P-1)/(P-1+M)
17.9 %
1F1B Same bubble, less memory
17.9 %
Interleaved V=4 (P-1)/(M*V)
5.5 %
DualPipe (P-1)/(2M), further hidden by overlap
10.9 %
class DualPipeSchedule:
    """
    DualPipe bidirectional pipeline schedule (DeepSeek-V3).

    Micro-batches flow from both ends simultaneously.
    Forward of left-to-right overlaps with backward of right-to-left.
    """
    def __init__(self, stages: int, num_microbatches: int):
        self.P = stages
        self.M = num_microbatches
        # Split micro-batches into two streams
        self.M_left = num_microbatches // 2   # flow left-to-right
        self.M_right = num_microbatches // 2  # flow right-to-left

    def bubble_fraction(self) -> float:
        """Theoretical bubble before overlap hiding."""
        return (self.P - 1) / (2 * self.M)

    def effective_bubble(self, overlap_efficiency: float = 0.8) -> float:
        """
        Effective bubble after forward/backward overlap.
        overlap_efficiency: fraction of bubble hidden by overlapping
        forward and backward of different micro-batches.
        """
        theoretical = self.bubble_fraction()
        return theoretical * (1 - overlap_efficiency)

    def schedule_for_stage(self, stage_id: int):
        """
        Each stage handles two streams:
        - Left stream: micro-batches 0..M_left-1 flowing stage 0 -> P-1
        - Right stream: micro-batches 0..M_right-1 flowing stage P-1 -> 0

        In steady state, forward(left) overlaps with backward(right)
        and forward(right) overlaps with backward(left).
        """
        actions = []

        # Warmup: fill pipeline from both ends
        for t in range(stage_id):
            actions.append(('idle', t))

        # Steady state: overlapped forward + backward
        for mb in range(self.M_left):
            actions.append(('forward_left + backward_right', mb))

        # Cooldown: drain from both ends
        for t in range(self.P - 1 - stage_id):
            actions.append(('drain', t))

        return actions
ℹ️ DualPipe Requirements

DualPipe requires that forward and backward computation can overlap efficiently on the same GPU. This works well for transformer models where forward is dominated by matrix multiplications and backward by gradient matrix multiplications — they can overlap on different CUDA streams if the GPU has sufficient compute units and memory bandwidth. Models with heavy sequential dependencies or irregular computation patterns may not overlap as well.

Memory Analysis: The Activation Storage Problem

Every pipeline schedule must store activations (intermediate results from the forward pass) until the corresponding backward pass consumes them. The number of stored activations determines peak memory usage, which directly limits model size and micro-batch size.

Activation Memory Per Micro-Batch

For a transformer layer with hidden dimension hh, sequence length ss, and micro-batch size bb, the stored activations include:

  • Input to each layer: b×s×hb \times s \times h elements
  • Attention intermediate states: b×nheads×s×sb \times n_{heads} \times s \times s elements (for full attention)
  • MLP intermediate activations: b×s×4hb \times s \times 4h elements (for standard FFN)
  • Layer norm inputs: b×s×hb \times s \times h elements

In BF16 (2 bytes per element), a single transformer layer’s activations are approximately:

Act per layerb×s×(10h+nheads×s)×2 bytes\text{Act per layer} \approx b \times s \times (10h + n_{heads} \times s) \times 2 \text{ bytes}

For typical values (b=1b = 1, s=4096s = 4096, h=8192h = 8192, nheads=64n_{heads} = 64):

Act per layer1×4096×(81920+262144)×22.8 GB\text{Act per layer} \approx 1 \times 4096 \times (81920 + 262144) \times 2 \approx 2.8 \text{ GB}

If a pipeline stage has 10 layers, the activation memory per micro-batch per stage is roughly 28 GB. Now multiply by the number of in-flight micro-batches.

GPU Memory Budget Per Pipeline Stage (H100 80GB)

Memory allocation for a single pipeline stage in a 405B model across 8 GPUs

0x3200 0x0000
0x6400 0x3200
0x7D00 0x6400
0xA000 0x7D00
0xC800 0xA000
Model Parameters (BF16) ~12.5 GB
Optimizer States (Adam) ~25 GB
Gradients (BF16) ~12.5 GB
Activations (in-flight) ~20-30 GB
Available / Buffers ~0-10 GB
405B / 8 stages / 2 (TP within node) = ~25B params = 50 GB / 4 (with TP=4) = 12.5 GB
FP32 master weights + momentum + variance = 4x parameter memory in FP32
Same size as parameters
P micro-batch activations for 1F1B, or M for GPipe
Communication buffers, temporary tensors, CUDA overhead
Model Parameters (BF16) ~12.5 GB
Optimizer States (Adam) ~25 GB
Gradients (BF16) ~12.5 GB
Activations (in-flight) ~20-30 GB
Available / Buffers ~0-10 GB
📊

Activation Memory Per Stage: GPipe vs 1F1B (10 layers/stage, b=1, s=4096, h=8192)

ScheduleIn-Flight MBsActivation MemoryFeasible on 80GB H100?
GPipe M=4 4 ~11 GB Yes (tight)
GPipe M=16 16 ~45 GB No -- exceeds budget
GPipe M=32 32 ~90 GB No -- far exceeds
1F1B P=4 4 ~11 GB Yes
1F1B P=8 8 ~22 GB Yes
1F1B P=16 16 ~45 GB Marginal

Activation Checkpointing (Recomputation)

Activation checkpointing (also called gradient checkpointing or activation recomputation) is the standard solution to the activation memory problem. Instead of storing all intermediate activations, you store only the input activation at each layer boundary and recompute the intermediate activations during the backward pass.

The trade-off is straightforward: you trade compute for memory. With full activation checkpointing, you store only 1 activation tensor per layer instead of ~4-5, reducing activation memory by roughly 4-5x. The cost is recomputing the forward pass during backward, adding approximately 33% more total compute (the forward pass is about 1/3 of the total forward + backward compute).

class ActivationCheckpointing:
    """
    Activation checkpointing reduces memory by recomputing
    activations during backward pass instead of storing them.
    """
    def __init__(self, checkpoint_every_n_layers: int = 1):
        self.checkpoint_interval = checkpoint_every_n_layers

    def memory_reduction(self, layers_per_stage: int) -> float:
        """Factor of memory reduction vs no checkpointing."""
        # Without checkpointing: store all activations for all layers
        # With checkpointing: store only checkpoint activations
        checkpoints = layers_per_stage // self.checkpoint_interval
        # Each checkpoint stores only the input tensor (not intermediates)
        # Intermediates are ~4-5x the input tensor
        return layers_per_stage * 5 / (checkpoints + layers_per_stage)

    def compute_overhead(self) -> float:
        """Additional compute as a fraction of total."""
        # Recompute forward pass during backward
        # Forward = ~1/3 of total (forward + backward)
        # So recomputation adds ~33% to total compute
        return 0.33

    def forward_with_checkpointing(self, layers, input_tensor):
        """
        Only store activations at checkpoint boundaries.
        Recompute intermediates during backward.
        """
        checkpointed_activations = [input_tensor]
        x = input_tensor

        for i, layer in enumerate(layers):
            # Use torch.utils.checkpoint to wrap the forward
            # This stores input but not intermediate activations
            x = torch.utils.checkpoint.checkpoint(layer, x)

            if (i + 1) % self.checkpoint_interval == 0:
                checkpointed_activations.append(x)

        return x, checkpointed_activations

Selective activation checkpointing is a refinement: only recompute the cheap-to-recompute activations (like attention scores, which are memory-heavy but compute-cheap relative to matrix multiplications). This gives most of the memory savings with less compute overhead.

📊

Activation Checkpointing Trade-offs

StrategyMemory SavingsCompute OverheadTypical Use Case
No checkpointing 0% 0% Small models, abundant memory
Full checkpointing ~80% ~33% Very large models, memory-constrained
Selective (attn only) ~50-60% ~10-15% Medium models, balanced trade-off
Every-2-layers ~50% ~16% Moderate memory pressure

Combining Pipeline, Tensor, and Data Parallelism

In practice, large-scale training uses all three parallelism strategies simultaneously. The typical recipe for a cluster with NN total GPUs, where each node has GG GPUs connected by NVLink:

  • Tensor parallelism (TP) within a node: TP=GTP = G (e.g., 8 GPUs per node). All-reduce operations stay on NVLink.
  • Pipeline parallelism (PP) across nodes: PP=PP = number of nodes in a pipeline. Point-to-point activation transfers go over InfiniBand.
  • Data parallelism (DP) across pipeline replicas: DP=N/(TP×PP)DP = N / (TP \times PP). Gradient all-reduce across DP replicas.

Total GPUs: N=TP×PP×DPN = TP \times PP \times DP.

Example: Training Llama 405B on 1024 H100s (128 nodes, 8 GPUs/node)

TP = 8  (within each node, NVLink)
PP = 8  (across 8 nodes per pipeline, InfiniBand)
DP = 16 (16 pipeline replicas, InfiniBand)

Total = 8 * 8 * 16 = 1024 GPUs

Each pipeline replica:
  - 8 nodes, 64 GPUs
  - Each node holds one pipeline stage
  - Within each node, 8 GPUs share one stage via TP
  - Model: 126 layers / 8 stages = ~16 layers per stage
  - Each TP group of 8 GPUs holds 16 layers with weights sharded 8-way

Data parallelism:
  - 16 replicas of this 64-GPU pipeline
  - Global batch split 16 ways
  - Gradient all-reduce across 16 replicas after each mini-batch

3D Parallelism Layout: 1024 GPUs for 405B Model

TP=8 (intra-node NVLink), PP=8 (inter-node IB), DP=16 (across pods)

Tensor Parallel Group (TP=8) 8 GPUs within one node, connected by NVLink (900 GB/s) All-reduce per layer, highest communication volume
Pipeline Parallel Group (PP=8) 8 nodes (64 GPUs), connected by InfiniBand (400 Gb/s) Point-to-point activation transfers at stage boundaries
Data Parallel Group (DP=16) 16 pipeline replicas (1024 GPUs total) All-reduce gradients once per mini-batch, overlapped with compute

Communication Patterns in 3D Parallelism

Each parallelism dimension has a distinct communication pattern:

Tensor parallelism (most frequent, highest bandwidth):

  • 2 all-reduce operations per transformer layer
  • Volume: O(b×s×h)O(b \times s \times h) per all-reduce
  • Must complete within the compute time of one layer
  • Requires NVLink-class bandwidth (hundreds of GB/s)

Pipeline parallelism (moderate frequency, moderate bandwidth):

  • 1 point-to-point transfer per pipeline stage boundary per micro-batch
  • Volume: O(b×s×h)O(b \times s \times h) per transfer
  • Can overlap with compute at adjacent stages
  • InfiniBand sufficient (tens of GB/s)

Data parallelism (least frequent, can overlap):

  • 1 all-reduce of all gradients per mini-batch
  • Volume: O(total parameters)O(\text{total parameters})
  • Can overlap with backward pass computation (ring all-reduce)
  • InfiniBand sufficient, can use network efficiently
class ThreeDParallelConfig:
    """
    Configuration for 3D parallelism: TP + PP + DP.
    """
    def __init__(self, tp: int, pp: int, dp: int,
                 gpus_per_node: int = 8):
        self.tp = tp
        self.pp = pp
        self.dp = dp
        self.total_gpus = tp * pp * dp
        self.gpus_per_node = gpus_per_node
        self.nodes = self.total_gpus // gpus_per_node

        # Validate: TP should fit within a node
        assert tp <= gpus_per_node, \
            f"TP={tp} exceeds GPUs per node={gpus_per_node}"

    def communication_volume_per_step(self, model_params: int,
                                       hidden: int, seq_len: int,
                                       micro_batch: int,
                                       layers_per_stage: int) -> dict:
        """Estimate communication volumes per training step."""
        activation_size = micro_batch * seq_len * hidden * 2  # BF16

        # TP: 2 all-reduces per layer, within node
        tp_volume = 2 * layers_per_stage * activation_size * (self.tp - 1) / self.tp
        tp_link = "NVLink"

        # PP: 1 point-to-point per stage boundary per micro-batch
        # Number of micro-batches depends on schedule
        pp_volume = activation_size  # per micro-batch per boundary
        pp_link = "InfiniBand" if self.tp == self.gpus_per_node else "NVLink"

        # DP: all-reduce of all gradients, once per mini-batch
        dp_volume = model_params * 2  # BF16 gradients
        dp_link = "InfiniBand"

        return {
            'tp': {'volume_bytes': tp_volume, 'link': tp_link,
                   'frequency': 'per layer per micro-batch'},
            'pp': {'volume_bytes': pp_volume, 'link': pp_link,
                   'frequency': 'per micro-batch per boundary'},
            'dp': {'volume_bytes': dp_volume, 'link': dp_link,
                   'frequency': 'once per mini-batch (overlapped)'},
        }
📊

Communication Volume by Parallelism Dimension (405B, h=13824, s=4096, b=1)

DimensionVolume Per EventFrequencyLink TypeLatency Sensitivity
TP (all-reduce) ~214 MB 2x per layer per MB NVLink Very high
PP (point-to-point) ~107 MB 1x per stage boundary InfiniBand Moderate
DP (all-reduce) ~810 GB 1x per mini-batch InfiniBand Low (overlapped)

Expert Parallelism and MoE

For Mixture-of-Experts models like DeepSeek-V3, there is a fourth dimension: expert parallelism (EP). Experts are distributed across GPUs, and an all-to-all communication pattern routes tokens to the correct expert. This adds another layer of complexity to the parallelism strategy but is orthogonal to pipeline parallelism.

Practical Concerns: Load Balancing Pipeline Stages

For pipeline parallelism to be efficient, every stage must take approximately the same time to compute. If one stage takes twice as long, every other stage waits for it, creating a bottleneck worse than the inherent bubble.

Transformer models are relatively easy to balance because every layer has the same architecture and roughly the same compute cost. You simply assign equal numbers of layers per stage: L/PL / P layers each. The first and last stages may have slight additional work (embedding lookup, output projection), but this is minor.

Non-uniform architectures (models with varying layer sizes, skip connections across stages, or different computation at different depths) are harder to balance. Profiling each layer’s compute time and partitioning to equalize stage times is necessary.

def balance_pipeline_stages(layer_compute_times: list, num_stages: int) -> list:
    """
    Partition layers into stages with balanced compute times.
    Uses dynamic programming to minimize the maximum stage time.

    Args:
        layer_compute_times: list of per-layer compute times in ms
        num_stages: number of pipeline stages P

    Returns:
        list of (start_layer, end_layer) tuples per stage
    """
    n = len(layer_compute_times)
    prefix_sum = [0] * (n + 1)
    for i in range(n):
        prefix_sum[i + 1] = prefix_sum[i] + layer_compute_times[i]

    # dp[i][j] = min possible max-stage-time using i stages for first j layers
    INF = float('inf')
    dp = [[INF] * (n + 1) for _ in range(num_stages + 1)]
    dp[0][0] = 0

    for s in range(1, num_stages + 1):
        for j in range(s, n + 1):
            for k in range(s - 1, j):
                stage_time = prefix_sum[j] - prefix_sum[k]
                dp[s][j] = min(dp[s][j], max(dp[s-1][k], stage_time))

    # Backtrack to find partition
    # (omitted for brevity -- standard DP backtracking)
    return partition_from_dp(dp, prefix_sum, num_stages)
⚠️ Embedding and Output Layers

The first pipeline stage typically includes the token embedding layer, and the last stage includes the output projection (language model head). For large-vocabulary models (e.g., 128K tokens), the embedding and output projection can be significant: a 128K vocabulary with hidden size 8192 in BF16 is about 2 GB. This asymmetry means the first and last stages have more memory pressure and slightly different compute profiles. Some implementations split the embedding across stages or tie the embedding and output projection weights to mitigate this.

When Pipeline Parallelism Is the Wrong Choice

Pipeline parallelism is not always the answer. There are scenarios where it introduces more overhead than it solves:

Models That Fit on One Node

If a model fits within a single node’s aggregate GPU memory (e.g., a 70B model on 8 H100s with 640 GB total), tensor parallelism alone is usually better. TP within a node over NVLink has:

  • Lower latency than PP’s pipeline bubble
  • No idle time from fill/drain phases
  • Simpler implementation and debugging

Pipeline parallelism only becomes necessary when you need to span multiple nodes, where TP’s all-reduce latency over InfiniBand becomes prohibitive.

Very Small Micro-Batch Sizes

When the global batch size is small or cannot be divided into enough micro-batches, the bubble dominates. If MM is close to PP, the bubble fraction approaches 50% even with 1F1B:

P1P1+MM=PP12P150%\frac{P - 1}{P - 1 + M} \xrightarrow{M = P} \frac{P - 1}{2P - 1} \approx 50\%

At small MM, you are better off using fewer pipeline stages (accepting less parallelism) or using other strategies entirely.

Latency-Sensitive Inference

Pipeline parallelism is designed for throughput, not latency. For a single request:

  • The request must pass through all PP stages sequentially
  • Latency = P×per-stage latency+(P1)×inter-stage communicationP \times \text{per-stage latency} + (P-1) \times \text{inter-stage communication}
  • No opportunity for micro-batching (only one sequence)

For inference, tensor parallelism is preferred because it reduces per-request latency by parallelizing each layer’s computation. Pipeline parallelism can help inference throughput when processing many concurrent requests (similar to micro-batching), but for latency-critical single-request scenarios, it adds overhead.

Heterogeneous Hardware

Pipeline parallelism assumes all stages take equal time. If your cluster has mixed GPU generations (e.g., some A100s and some H100s), the slowest GPU becomes the bottleneck for every stage. DP is more forgiving of heterogeneous hardware because slower workers simply process smaller batch portions.

📊

When to Use Each Parallelism Strategy

ScenarioBest StrategyWhy Not PP?Recommendation
70B on 8xH100 (1 node) TP=8 No bubble, NVLink fast enough TP only
405B on 64 GPUs (8 nodes) TP=8, PP=8 Need PP for inter-node TP + PP
405B on 1024 GPUs TP=8, PP=8, DP=16 Full 3D parallelism TP + PP + DP
7B model, many GPUs DP only Model fits on 1 GPU, PP bubble wasted DP
Single-request inference TP PP adds latency, no micro-batching TP
High-throughput inference TP + PP PP helps throughput with batched requests TP + PP

Summary of Pipeline Schedules

The evolution of pipeline parallelism schedules represents steady progress toward eliminating the bubble:

📊

Pipeline Schedule Evolution

ScheduleYearBubble FormulaPeak ActivationsKey Innovation
Naive N/A (P-1)/P 1 No micro-batching
GPipe 2019 (P-1)/(P-1+M) M Micro-batch pipelining
1F1B (PipeDream-Flush) 2019 (P-1)/(P-1+M) P Interleaved F/B, less memory
Interleaved (Megatron) 2021 (P-1)/(M*V) P Virtual stages, smaller bubble
DualPipe (DeepSeek-V3) 2024 ~(P-1)/(2M) ~2P Bidirectional, overlap F+B
def compare_schedules(P: int, M: int, V: int = 1) -> dict:
    """
    Compare bubble fractions and memory requirements
    across pipeline parallelism schedules.

    Args:
        P: number of pipeline stages
        M: number of micro-batches
        V: number of virtual stages (for interleaved schedule)
    """
    return {
        'naive': {
            'bubble': (P - 1) / P,
            'peak_activations': 1,
            'description': 'Single batch, no pipelining'
        },
        'gpipe': {
            'bubble': (P - 1) / (P - 1 + M),
            'peak_activations': M,
            'description': 'All forwards then all backwards'
        },
        '1f1b': {
            'bubble': (P - 1) / (P - 1 + M),
            'peak_activations': P,
            'description': 'Alternating forward and backward'
        },
        'interleaved': {
            'bubble': (P - 1) / (M * V),
            'peak_activations': P,
            'communication_factor': V,
            'description': f'Virtual stages V={V}, more comm but less bubble'
        },
        'dualpipe': {
            'bubble': (P - 1) / (2 * M),
            'peak_activations': 2 * P,  # approximate
            'description': 'Bidirectional, forward+backward overlap'
        }
    }

# Example: P=8 stages, M=32 micro-batches, V=4 virtual stages
results = compare_schedules(P=8, M=32, V=4)
# naive:       87.5% bubble
# gpipe:       17.9% bubble, 32 activations
# 1f1b:        17.9% bubble, 8 activations
# interleaved: 5.5% bubble, 8 activations, 4x comm
# dualpipe:    10.9% bubble (before overlap hiding), ~16 activations

Effective Throughput Relative to Ideal (P=8, M=32)

(% of ideal)
Ideal (no bubble) Theoretical max
100 % of ideal
Naive 87.5% bubble
12.5 % of ideal
GPipe 17.9% bubble
82.1 % of ideal
1F1B Same throughput, 4x less memory
82.1 % of ideal
Interleaved V=4 5.5% bubble
94.5 % of ideal
DualPipe ~4% effective with overlap
96 % of ideal

Implementation Reference: Frameworks and Their Schedules

Major distributed training frameworks implement different subsets of these schedules:

Megatron-LM (NVIDIA): Supports 1F1B and interleaved schedules. The de facto standard for large-scale LLM training. Tight integration with TP and DP.

DeepSpeed (Microsoft): Supports GPipe-style and 1F1B schedules via its pipeline engine. Part of the ZeRO ecosystem for memory optimization.

FairScale / PyTorch FSDP: PyTorch-native pipeline parallelism with 1F1B schedule. Integrates with FSDP for data parallelism.

DeepSeek’s training infrastructure: Custom DualPipe implementation. Not yet available as a general-purpose library (as of early 2025).

# Megatron-LM example configuration for 3D parallelism
megatron_config = {
    'tensor_model_parallel_size': 8,        # TP = 8 (intra-node)
    'pipeline_model_parallel_size': 8,       # PP = 8 (inter-node)
    'data_parallel_size': 16,                # DP = 16
    'virtual_pipeline_model_parallel_size': 4,  # V = 4 (interleaved)
    'num_micro_batches': 32,                 # M = 32
    'global_batch_size': 2048,               # split across DP replicas
    'micro_batch_size': 4,                   # per micro-batch
    'sequence_length': 4096,
    'num_layers': 126,                       # Llama 405B
    'hidden_size': 13824,
    'num_attention_heads': 128,
    'fp16': False,
    'bf16': True,
    'recompute_activations': True,           # activation checkpointing
    'recompute_granularity': 'selective',     # only recompute attention
}

Conclusion

Pipeline parallelism exists to solve a specific problem: training models that exceed single-node memory while avoiding the bandwidth penalty of cross-node tensor parallelism. The core challenge is the pipeline bubble — idle GPU time during pipeline fill and drain.

The evolution from naive pipelining through GPipe, 1F1B, interleaved stages, and DualPipe represents a progression from 87.5% idle time to under 5% effective idle time for typical configurations. Each advance either shrinks the bubble (more micro-batches, virtual stages, bidirectional flow) or reduces the memory cost of shrinking it (1F1B’s bounded activation storage, activation checkpointing).

The practical recipe for large-scale training has converged: tensor parallelism within a node (where NVLink provides the bandwidth for frequent all-reduce), pipeline parallelism across nodes (where only activation tensors cross the slower InfiniBand links), and data parallelism across pipeline replicas (where gradient all-reduce can overlap with backward computation). Understanding the trade-offs between bubble fraction, activation memory, and communication volume is essential for configuring these systems efficiently.

The bubble is the tax you pay for splitting a model across devices. The history of pipeline parallelism is the history of minimizing that tax.