Part of Series Inference Optimization Timeline 42 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Standard data-parallel training on 64 GPUs stores 64 identical copies of your Adam optimizer state. That’s 63 copies you’re paying for that do nothing except consume memory. A 70B parameter model trained with mixed-precision Adam needs 16 bytes per parameter: 2 for FP16 weights, 2 for FP16 gradients, 4 for FP32 master weights, 4 for momentum, 4 for variance. Multiply by 70 billion parameters and you get 1.12 TB per GPU. An H100 has 80 GB of HBM. The model doesn’t fit, period. ZeRO eliminates this redundancy by partitioning optimizer state across the data-parallel group so each GPU stores only 1/64th of it. This unlocks training runs that were physically impossible before, but the tradeoff is communication overhead every optimizer step. This post covers the memory math, the three ZeRO stages, communication costs, and when ZeRO beats alternatives like tensor parallelism.

Why ZeRO Exists: The Memory Arithmetic of Large Model Training

To understand ZeRO, you first need to understand exactly where memory goes during training. Consider a model with Φ\Phi parameters trained with mixed-precision Adam (the most common setup for LLMs).

Per-Parameter Memory Breakdown

In mixed-precision training with Adam, each parameter consumes:

  • FP16 parameters: 22 bytes per parameter (the weights used in forward/backward)
  • FP16 gradients: 22 bytes per parameter (computed during backpropagation)
  • FP32 master copy of parameters: 44 bytes per parameter (Adam updates happen in FP32 for numerical stability)
  • FP32 first moment (momentum): 44 bytes per parameter
  • FP32 second moment (variance): 44 bytes per parameter

Total per parameter: 2+2+4+4+4=162 + 2 + 4 + 4 + 4 = 16 bytes.

The optimizer states alone (FP32 copy + momentum + variance) account for 1212 bytes per parameter — 75% of the total memory. The parameters and gradients that the model actually uses for computation are only 44 bytes combined.

ℹ️ The 16-Byte Rule

For mixed-precision Adam, multiply parameter count by 16 to get total training memory in bytes. A 1B parameter model needs 16 GB. A 70B parameter model needs 1,120 GB (1.12 TB). No single GPU comes close.

Concrete Numbers for a 70B Model

Let Φ=70×109\Phi = 70 \times 10^9 parameters.

ComponentBytes per ParamTotal for 70B
FP16 Parameters2140 GB
FP16 Gradients2140 GB
FP32 Master Copy4280 GB
FP32 Momentum4280 GB
FP32 Variance4280 GB
Total161,120 GB

An NVIDIA A100 has 80 GB of HBM. An H100 has 80 GB. Even the upcoming B200 tops out at 192 GB. A single GPU cannot hold 1.12 TB.

Memory Layout: 70B Model with Mixed-Precision Adam (Single GPU, No Sharding)

Total: 1,120 GB -- does not fit on any single GPU

0x23000 0x00000
0x46000 0x23000
0x8C000 0x46000
0xD2000 0x8C000
0x118000 0xD2000
FP16 Parameters 140 GB
FP16 Gradients 140 GB
FP32 Master Copy 280 GB
FP32 Momentum (m) 280 GB
FP32 Variance (v) 280 GB
Weights used in forward/backward pass
Computed during backpropagation
Full-precision copy for Adam updates
First moment estimate
Second moment estimate
FP16 Parameters 140 GB
FP16 Gradients 140 GB
FP32 Master Copy 280 GB
FP32 Momentum (m) 280 GB
FP32 Variance (v) 280 GB

The Redundancy Problem in Data Parallelism

Standard data-parallel (DP) training replicates the entire model on every GPU. With 8 GPUs, you store 8 copies of everything: 8×1,120 GB=8,960 GB8 \times 1{,}120 \text{ GB} = 8{,}960 \text{ GB} of total memory, but you only have 8×80=640 GB8 \times 80 = 640 \text{ GB} available. The model does not fit.

Even if you had enough memory, the redundancy is wasteful. Every GPU holds the same optimizer states, does the same Adam update on different gradients, and ends up with the same updated parameters. The only unique work each GPU does is computing gradients on its own mini-batch. ZeRO exploits this observation.

ZeRO Stages: Eliminating Redundancy Progressively

ZeRO comes in three stages, each sharding an additional component across the data-parallel group. Let NdN_d be the data-parallel degree (number of GPUs).

Stage 1: Partition Optimizer States (PosP_{os})

ZeRO Stage 1 partitions only the optimizer states across GPUs. Each GPU stores 1/Nd1/N_d of the optimizer states but keeps a full copy of both parameters and gradients.

Memory per GPU with ZeRO-1:

MZeRO-1=2Φ+2Φ+12ΦNd=4Φ+12ΦNdM_{\text{ZeRO-1}} = 2\Phi + 2\Phi + \frac{12\Phi}{N_d} = 4\Phi + \frac{12\Phi}{N_d}

For our 70B model on 8 GPUs:

MZeRO-1=4(70×109)+12(70×109)8=280 GB+105 GB=385 GB per GPUM_{\text{ZeRO-1}} = 4(70 \times 10^9) + \frac{12(70 \times 10^9)}{8} = 280 \text{ GB} + 105 \text{ GB} = 385 \text{ GB per GPU}

Still does not fit on an 80 GB GPU, but this is a dramatic improvement over 1,120 GB. With 64 GPUs, it drops to:

MZeRO-1=280+84064=280+13.1=293.1 GB per GPUM_{\text{ZeRO-1}} = 280 + \frac{840}{64} = 280 + 13.1 = 293.1 \text{ GB per GPU}

Still too large because of the full parameter and gradient copies. This is why Stage 1 alone is insufficient for very large models.

How it works at runtime:

  1. Forward pass: all GPUs use their local copy of parameters (normal).
  2. Backward pass: all GPUs compute gradients (normal), then all-reduce gradients across GPUs (same as standard DP).
  3. Optimizer step: each GPU updates only its 1/Nd1/N_d partition of parameters using the all-reduced gradients and its local optimizer state partition. After the update, an all-gather broadcasts the updated parameter partitions to all GPUs.
# ZeRO-1: Conceptual optimizer step
def zero1_step(params, grads, opt_states, rank, world_size):
    # 1. All-reduce gradients (same as standard DP)
    for g in grads:
        dist.all_reduce(g, op=dist.ReduceOp.AVG)

    # 2. Each GPU updates only its partition of parameters
    chunk_size = len(params) // world_size
    start = rank * chunk_size
    end = start + chunk_size

    for i in range(start, end):
        # Adam update using local optimizer state
        opt_states[i]['m'] = beta1 * opt_states[i]['m'] + (1 - beta1) * grads[i]
        opt_states[i]['v'] = beta2 * opt_states[i]['v'] + (1 - beta2) * grads[i] ** 2
        params[i] -= lr * opt_states[i]['m'] / (opt_states[i]['v'].sqrt() + eps)

    # 3. All-gather updated parameters so every GPU has the full model
    dist.all_gather(params)

Stage 2: Partition Optimizer States + Gradients (Pos+gP_{os+g})

ZeRO Stage 2 additionally partitions gradients. Each GPU only needs to store the gradients for the parameters whose optimizer states it owns. The key mechanism is replacing the all-reduce of gradients with a reduce-scatter: each GPU ends up with the reduced (averaged) gradient for only its parameter partition.

Memory per GPU with ZeRO-2:

MZeRO-2=2Φ+2ΦNd+12ΦNd=2Φ+14ΦNdM_{\text{ZeRO-2}} = 2\Phi + \frac{2\Phi}{N_d} + \frac{12\Phi}{N_d} = 2\Phi + \frac{14\Phi}{N_d}

For 70B on 8 GPUs:

MZeRO-2=140+14(70×109)8 bytes=140 GB+122.5 GB=262.5 GB per GPUM_{\text{ZeRO-2}} = 140 + \frac{14(70 \times 10^9)}{8} \text{ bytes} = 140 \text{ GB} + 122.5 \text{ GB} = 262.5 \text{ GB per GPU}

For 70B on 64 GPUs:

MZeRO-2=140+98064=140+15.3=155.3 GB per GPUM_{\text{ZeRO-2}} = 140 + \frac{980}{64} = 140 + 15.3 = 155.3 \text{ GB per GPU}

Still not fitting on 80 GB GPUs because we still hold the full FP16 parameter copy on each GPU (140 GB for a 70B model).

How it works at runtime:

  1. Forward pass: same as Stage 1.
  2. Backward pass: as gradients are computed, a reduce-scatter operation immediately distributes and reduces them. Each GPU retains only the gradient shard it owns. Gradients not owned by this GPU are discarded after reduction.
  3. Optimizer step: each GPU runs Adam on its partition using its local gradient shard and optimizer state shard. Then an all-gather synchronizes updated parameters.
# ZeRO-2: Replace all-reduce with reduce-scatter for gradients
def zero2_backward_hook(grad, rank, world_size, param_index):
    chunk_size = grad.numel() // world_size

    # Reduce-scatter: each GPU gets the reduced gradient for its partition
    reduced_shard = torch.zeros(chunk_size, device=grad.device)
    dist.reduce_scatter(reduced_shard, list(grad.chunk(world_size)))

    if param_index // chunk_size == rank:
        return reduced_shard  # Keep only our shard
    else:
        return None  # Discard -- we don't own this gradient

Stage 3: Partition Everything (Pos+g+pP_{os+g+p})

ZeRO Stage 3 partitions parameters as well. No GPU holds a full copy of the model. Each GPU stores only 1/Nd1/N_d of the parameters, gradients, and optimizer states.

Memory per GPU with ZeRO-3:

MZeRO-3=2ΦNd+2ΦNd+12ΦNd=16ΦNdM_{\text{ZeRO-3}} = \frac{2\Phi}{N_d} + \frac{2\Phi}{N_d} + \frac{12\Phi}{N_d} = \frac{16\Phi}{N_d}

For 70B on 8 GPUs:

MZeRO-3=16×70×1098=1,1208=140 GB per GPUM_{\text{ZeRO-3}} = \frac{16 \times 70 \times 10^9}{8} = \frac{1{,}120}{8} = 140 \text{ GB per GPU}

Still too much for 80 GB. On 16 GPUs:

MZeRO-3=1,12016=70 GB per GPUM_{\text{ZeRO-3}} = \frac{1{,}120}{16} = 70 \text{ GB per GPU}

This fits on an 80 GB A100/H100 with 10 GB headroom for activations.

On 64 GPUs:

MZeRO-3=1,12064=17.5 GB per GPUM_{\text{ZeRO-3}} = \frac{1{,}120}{64} = 17.5 \text{ GB per GPU}

Plenty of room for large batch sizes and activation memory.

The ZeRO-3 Trade-off

ZeRO-3 achieves perfect memory scaling (1/Nd1/N_d reduction) at the cost of additional communication. Before every forward layer computation, the full parameters for that layer must be all-gathered from all GPUs. After the backward pass through each layer, the parameters are discarded. This means parameters are materialized on-demand, layer by layer.

How it works at runtime:

  1. Forward pass: before computing each layer, all-gather that layer’s parameters from all GPUs. Compute the layer. Discard the gathered parameters (keep only the local shard).
  2. Backward pass: all-gather parameters again for each layer (needed for gradient computation). Compute gradients. Reduce-scatter gradients so each GPU keeps only its shard. Discard gathered parameters.
  3. Optimizer step: each GPU updates its 1/Nd1/N_d shard of parameters using its local gradient and optimizer state shards. No all-gather needed after the step because the next forward pass will gather parameters on demand.
# ZeRO-3: Parameter gathering for forward pass
class ZeRO3Linear(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight_shard, rank, world_size):
        # All-gather the full weight from all shards
        full_weight = all_gather_parameter(weight_shard, world_size)
        output = input @ full_weight.T

        ctx.save_for_backward(input, weight_shard)
        ctx.rank = rank
        ctx.world_size = world_size

        # Discard full_weight -- only keep our shard
        del full_weight
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input, weight_shard = ctx.saved_tensors

        # All-gather weight again for gradient computation
        full_weight = all_gather_parameter(weight_shard, ctx.world_size)
        grad_input = grad_output @ full_weight

        # Compute gradient w.r.t. weight, then reduce-scatter
        grad_weight_full = grad_output.T @ input
        grad_weight_shard = reduce_scatter(grad_weight_full, ctx.world_size)

        del full_weight
        return grad_input, grad_weight_shard, None, None

Memory Summary Across All Stages

📊

Memory per GPU: 70B Model with Mixed-Precision Adam

ConfigurationFormula8 GPUs16 GPUs64 GPUs
No ZeRO (DP) 16*Phi 1,120 GB 1,120 GB 1,120 GB
ZeRO-1 (Os) 4*Phi + 12*Phi/Nd 385 GB 332.5 GB 293.1 GB
ZeRO-2 (Os+g) 2*Phi + 14*Phi/Nd 262.5 GB 201.3 GB 155.3 GB
ZeRO-3 (Os+g+p) 16*Phi/Nd 140 GB 70 GB 17.5 GB
Note: Phi = 70B parameters. Memory shown excludes activations and temporary buffers. 80 GB GPUs highlighted where model fits.

Memory per GPU vs. Number of GPUs (70B Model)

(GB)
No ZeRO 8 GPUs
1,120 GB
ZeRO-1 8 GPUs
385 GB
ZeRO-2 8 GPUs
262 GB
ZeRO-3 (8) 8 GPUs
140 GB
ZeRO-3 (16) 16 GPUs -- fits 80GB!
70 GB
ZeRO-3 (64) 64 GPUs
17 GB

Communication Analysis

ZeRO’s memory savings come at a communication cost. Understanding the exact communication volume is critical for predicting training throughput.

Baseline: Standard Data Parallelism

Standard DP performs one all-reduce of the gradients per training step. An all-reduce of a tensor of size SS bytes over NdN_d GPUs using a ring-based algorithm communicates 2SNd1Nd2S2S \cdot \frac{N_d - 1}{N_d} \approx 2S bytes total per GPU. For our 70B model with FP16 gradients (S=2Φ=140S = 2\Phi = 140 GB):

VDP=2×140=280 GB per GPU per stepV_{\text{DP}} = 2 \times 140 = 280 \text{ GB per GPU per step}

ZeRO-1: No Extra Communication

ZeRO-1 replaces the all-reduce with a reduce-scatter (to distribute reduced gradients to the right owners) followed by an all-gather (to broadcast updated parameters). The reduce-scatter communicates SNd1NdSS \cdot \frac{N_d - 1}{N_d} \approx S bytes, and the all-gather communicates the same. Total:

VZeRO-1=S+S=2S=280 GB per GPU per stepV_{\text{ZeRO-1}} = S + S = 2S = 280 \text{ GB per GPU per step}

This is the same volume as standard DP all-reduce. ZeRO-1 adds no extra communication — it just restructures it.

ZeRO-2: Same Communication, Better Overlap

ZeRO-2 also communicates the same total volume as standard DP. The reduce-scatter happens during the backward pass (as gradients become available), and the all-gather happens after the optimizer step. The volume is identical:

VZeRO-2=2S=280 GB per GPU per stepV_{\text{ZeRO-2}} = 2S = 280 \text{ GB per GPU per step}

The advantage of ZeRO-2 over ZeRO-1 is purely in memory: gradients are discarded after reduce-scatter rather than stored in full. Communication volume is unchanged.

ZeRO-3: Additional All-Gather Overhead

ZeRO-3 adds parameter all-gathers during both the forward and backward passes. For each layer, the full parameters must be gathered before computation. Since there are forward and backward passes, parameters are gathered twice per step (some implementations cache parameters from the forward pass for the backward pass, but this trades memory for communication).

The parameter all-gather volume per pass: Nd1Nd×2Φ2Φ\frac{N_d - 1}{N_d} \times 2\Phi \approx 2\Phi bytes per GPU. Two passes (forward + backward) give 2×2Φ=4Φ2 \times 2\Phi = 4\Phi additional bytes.

Total ZeRO-3 communication per step:

VZeRO-3=2Φreduce-scatter (gradients)+2Φall-gather (forward params)+2Φall-gather (backward params)=6Φ3×VDPV_{\text{ZeRO-3}} = \underbrace{2\Phi}_{\text{reduce-scatter (gradients)}} + \underbrace{2\Phi}_{\text{all-gather (forward params)}} + \underbrace{2\Phi}_{\text{all-gather (backward params)}} = 6\Phi \approx 3 \times V_{\text{DP}}

For 70B: VZeRO-3=6×140=840V_{\text{ZeRO-3}} = 6 \times 140 = 840 GB per GPU per step, compared to 280 GB for standard DP. That is a 3×3\times communication overhead.

📊

Communication Volume per GPU per Training Step

MethodVolume Formula70B Model (GB)Overhead vs DP
Standard DP 2 * 2*Phi 280 1.0x (baseline)
ZeRO-1 2 * 2*Phi 280 1.0x
ZeRO-2 2 * 2*Phi 280 1.0x
ZeRO-3 3 * 2 * 2*Phi 840 3.0x
Note: Volume calculated using ring-based collective algorithms. Phi = 70B. FP16 parameters and gradients (2 bytes each).

Can ZeRO-3 Communication Be Hidden?

In practice, ZeRO-3’s communication overhead is partially masked by overlapping communication with computation. The parameter all-gather for layer L+1L+1 can be issued while layer LL is computing. On modern hardware with dedicated communication channels (NVLink, InfiniBand), this overlap can hide 50-80% of the communication latency for large enough layers.

The key variables that determine how effectively communication is hidden:

  • Layer size: larger layers have more compute to overlap with.
  • Interconnect bandwidth: NVLink (900 GB/s on H100 NVSwitch) versus InfiniBand HDR (200 Gb/s = 25 GB/s) versus InfiniBand NDR (400 Gb/s = 50 GB/s).
  • Batch size: larger micro-batch sizes increase the compute-to-communication ratio.
# DeepSpeed config for communication overlap in ZeRO-3
ds_config = {
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,           # Overlap communication with compute
        "contiguous_gradients": True,    # Reduce memory fragmentation
        "prefetch_bucket_size": 5e7,     # Prefetch 50M params at a time
        "param_persistence_threshold": 1e5,  # Keep small params on all GPUs
        "reduce_bucket_size": 5e7,       # Bucket size for gradient reduction
        "stage3_prefetch_bucket_size": 5e7,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    }
}

PyTorch FSDP: The Native ZeRO-3

PyTorch’s Fully Sharded Data Parallel (FSDP) is essentially ZeRO-3 implemented natively in the PyTorch framework. It was introduced in PyTorch 1.11 and significantly improved in PyTorch 2.0+ with FSDP2. Understanding the differences matters when choosing between DeepSpeed ZeRO and FSDP.

Conceptual Equivalence

ConceptDeepSpeed ZeRO-3PyTorch FSDP
Parameter shardingFlat parameter partitioningPer-module wrapping
Gradient reductionReduce-scatterReduce-scatter
Parameter gatheringAll-gather per forward/backwardAll-gather per FSDP unit
Optimizer state shardingAutomaticAutomatic
Mixed precisionVia DeepSpeed configVia MixedPrecision policy

Both systems achieve the same memory reduction (16Φ/Nd16\Phi/N_d per GPU). The differences are in API design, flexibility, and ecosystem integration.

API Comparison

# DeepSpeed ZeRO-3 setup
import deepspeed

ds_config = {
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 2,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    },
    "fp16": {"enabled": True},
    "optimizer": {
        "type": "AdamW",
        "params": {"lr": 1e-4, "weight_decay": 0.01}
    }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)

# Training loop
for batch in dataloader:
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()
# PyTorch FSDP setup (PyTorch 2.0+)
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
    CPUOffload
)

mp_policy = MixedPrecision(
    param_dtype=torch.float16,
    reduce_dtype=torch.float16,
    buffer_dtype=torch.float16
)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3
    mixed_precision=mp_policy,
    cpu_offload=CPUOffload(offload_params=True),
    auto_wrap_policy=size_based_auto_wrap_policy,
    device_id=torch.cuda.current_device()
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# Training loop -- standard PyTorch
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Key Differences

📊

DeepSpeed ZeRO-3 vs PyTorch FSDP

AspectDeepSpeed ZeRO-3PyTorch FSDP
Ecosystem Standalone library Native PyTorch
Offloading CPU + NVMe (ZeRO-Infinity) CPU only
Sharding granularity Flat buffer across all params Per-module wrapping
Activation checkpointing Built-in support Composable with torch.utils.checkpoint
Pipeline parallelism Built-in (PipelineModule) Separate (PiPPy / torch.distributed.pipelining)
Custom kernels Fused Adam, comm overlap Relies on PyTorch compiler
torch.compile support Limited Full support in FSDP2
Maintenance burden External dependency Ships with PyTorch

When to Use Each

Use DeepSpeed ZeRO when:

  • You need NVMe offloading (ZeRO-Infinity) for models that exceed GPU + CPU memory.
  • You want integrated pipeline parallelism.
  • You are using the Hugging Face Trainer with DeepSpeed integration (well-tested path).
  • You need ZeRO-1 or ZeRO-2 specifically (FSDP only offers FULL_SHARD or SHARD_GRAD_OP).

Use PyTorch FSDP when:

  • You want to stay within the PyTorch-native ecosystem.
  • You need torch.compile compatibility (critical for performance on H100s).
  • You are building a custom training loop and prefer standard PyTorch APIs.
  • You are on PyTorch 2.1+ and want FSDP2’s improved composability.
💡 FSDP Sharding Strategies Map to ZeRO Stages

ShardingStrategy.FULL_SHARD is equivalent to ZeRO-3 (partition everything). ShardingStrategy.SHARD_GRAD_OP is approximately ZeRO-2 (partition gradients and optimizer states but keep full parameters). ShardingStrategy.NO_SHARD is equivalent to standard DDP (ZeRO-0). There is no direct FSDP equivalent to ZeRO-1 alone.

ZeRO-Offload and ZeRO-Infinity

ZeRO-Offload: CPU as Extended Memory

When GPU memory is insufficient even with ZeRO-3, ZeRO-Offload moves optimizer states (and optionally parameters) to CPU RAM. Modern servers have 512 GB to 2 TB of CPU memory, which is 6-25x larger than GPU HBM.

The trade-off is latency. CPU memory bandwidth is approximately 50-100 GB/s (DDR5), compared to 2-3 TB/s for GPU HBM. The optimizer step, which requires reading and writing optimizer states, becomes significantly slower.

How ZeRO-Offload works:

  1. Forward pass: parameters are on GPU (or gathered to GPU via all-gather if using ZeRO-3).
  2. Backward pass: gradients are computed on GPU, then offloaded to CPU.
  3. Optimizer step: performed on CPU using offloaded gradients and CPU-resident optimizer states. Updated parameters are then copied back to GPU.
# DeepSpeed config for ZeRO-Offload
ds_config = {
    "zero_optimization": {
        "stage": 2,  # or 3
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True,       # Use pinned memory for faster transfers
            "buffer_count": 4,        # Overlap CPU compute with GPU-CPU transfer
            "fast_init": False
        },
        "offload_param": {
            "device": "cpu",          # Also offload parameters (ZeRO-3 only)
            "pin_memory": True
        }
    }
}

Latency impact of CPU offloading:

The optimizer step on CPU is typically 2-5x slower than on GPU. However, for very large models where the alternative is “cannot train at all,” this is an acceptable trade-off. DeepSpeed mitigates the cost by:

  • Using pinned memory for faster PCIe transfers.
  • Overlapping CPU optimizer computation with GPU forward/backward computation of the next micro-batch.
  • Using a fused CPU Adam implementation in C++ (avoiding Python overhead).
📊

ZeRO-Offload Throughput Impact

Configuration70B on 8x A100Throughput (tokens/sec)Memory/GPU
ZeRO-3 (GPU only) Doesn't fit N/A 140 GB needed
ZeRO-3 + CPU Offload (optimizer) Fits ~1,800 ~45 GB
ZeRO-3 + CPU Offload (optimizer + params) Fits ~900 ~15 GB
ZeRO-3 on 16x A100 (no offload) Fits ~3,200 ~70 GB
Note: Throughput estimates for 2K sequence length, micro-batch-size 1. More GPUs without offloading is almost always faster.

ZeRO-Infinity: NVMe as the Final Tier

ZeRO-Infinity extends offloading to NVMe SSDs for cases where even CPU memory is insufficient. A single NVMe SSD provides 3-7 GB/s of sequential read bandwidth, and with 4-8 SSDs in RAID 0, you can reach 20-50 GB/s.

ZeRO-Infinity creates a three-tier memory hierarchy:

  1. GPU HBM (~80 GB, ~3 TB/s): Active computation tensors
  2. CPU RAM (~1 TB, ~100 GB/s): Optimizer states, parameter staging
  3. NVMe SSD (~4-16 TB, ~20-50 GB/s with RAID): Cold optimizer states, parameter overflow
# DeepSpeed config for ZeRO-Infinity (NVMe offloading)
ds_config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": True,
            "buffer_count": 5,
            "fast_init": False
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": True,
            "buffer_count": 5,
            "max_in_cpu": 1e9    # Keep up to 1B params in CPU as cache
        },
        "aio": {
            "block_size": 1048576,   # 1MB blocks for async I/O
            "queue_depth": 8,
            "thread_count": 1,
            "single_submit": False,
            "overlap_events": True
        }
    }
}
⚠️ NVMe Offloading Is a Last Resort

NVMe offloading introduces significant latency. Sequential read at 7 GB/s means reading 140 GB of parameters from NVMe takes 20 seconds. Even with prefetching and overlap, NVMe-offloaded training is typically 5-10x slower than GPU-only ZeRO-3 with enough GPUs. Use it only when you have no other option — when the model is so large that neither GPU memory nor CPU memory suffices.

ZeRO-Infinity Memory Hierarchy

Three-tier offloading: GPU HBM, CPU DRAM, NVMe SSD

GPU HBM (80 GB) Active layer params + activations + gradient computation ~3 TB/s bandwidth
CPU DRAM (512 GB - 2 TB) Optimizer states + parameter staging area ~100 GB/s bandwidth via PCIe Gen5
NVMe SSD (4-16 TB) Cold optimizer states + parameter overflow ~7 GB/s per SSD, ~30 GB/s in RAID 0

Real Training Throughput Analysis

Theory is useful, but practitioners care about tokens per second per GPU. The following numbers are representative of real-world training workloads, gathered from published benchmarks and community reports.

Throughput Across ZeRO Stages

Training Throughput: 13B Model on 8x A100 80GB

(tokens/sec/GPU)
DDP (no ZeRO) OOM at batch>2
4,200 tokens/sec/GPU
ZeRO-1 1.0x comm
4,100 tokens/sec/GPU
ZeRO-2 1.0x comm
3,900 tokens/sec/GPU
ZeRO-3 3.0x comm
3,100 tokens/sec/GPU
ZeRO-3 + CPU Offload PCIe bottleneck
1,600 tokens/sec/GPU

Training Throughput: 70B Model on 64x A100 80GB

(tokens/sec/GPU)
ZeRO-1 OOM
0 tokens/sec/GPU
ZeRO-2 OOM
0 tokens/sec/GPU
ZeRO-3 Fits at 17.5 GB/GPU
2,400 tokens/sec/GPU
ZeRO-3 + GAS=4 Gradient accumulation
2,800 tokens/sec/GPU
+16.7%
3D Parallel (TP=8, PP=8) TP within node
3,200 tokens/sec/GPU
+33.3%

Communication Overhead at Each Stage

The overhead of ZeRO depends heavily on the interconnect. The following table shows the communication time as a percentage of total step time for different network configurations.

📊

Communication Overhead as % of Step Time (13B Model, 8 GPUs)

ZeRO StageNVLink (900 GB/s)IB NDR (50 GB/s)IB HDR (25 GB/s)
ZeRO-1 2% 8% 15%
ZeRO-2 2% 8% 15%
ZeRO-3 5% 22% 38%
ZeRO-3 (with overlap) 3% 12% 22%
Note: Within a single node, NVLink dominates. Across nodes, InfiniBand bandwidth becomes the bottleneck. Overlap assumes prefetching next layer's parameters during current layer's compute.

Throughput Scaling with GPU Count

📊

Scaling Efficiency: 70B Model Training (tokens/sec total)

GPUsZeRO-3ZeRO-3 + Offload3D Parallel (TP+PP+DP)
16 28,800 12,000 38,400
32 60,000 22,000 72,000
64 112,000 38,000 140,000
128 200,000 N/A 260,000
256 360,000 N/A 500,000
Note: 3D parallel combines tensor parallelism (within node), pipeline parallelism (across node pairs), and data parallelism (across groups). It achieves higher throughput but requires more complex configuration.

DeepSpeed Configuration Reference

The following configuration demonstrates a production-ready ZeRO-3 setup for training a large model.

{
  "train_batch_size": 512,
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 32,
  "gradient_clipping": 1.0,

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 50000000,
    "stage3_prefetch_bucket_size": 50000000,
    "stage3_param_persistence_threshold": 100000,
    "stage3_max_live_parameters": 1000000000,
    "stage3_max_reuse_distance": 1000000000,
    "stage3_gather_16bit_weights_on_model_save": true,

    "offload_optimizer": {
      "device": "none"
    },
    "offload_param": {
      "device": "none"
    }
  },

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 3e-4,
      "betas": [0.9, 0.95],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 3e-4,
      "warmup_num_steps": 2000,
      "total_num_steps": 100000
    }
  },

  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": true,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false
  }
}

Key parameters explained:

  • stage3_prefetch_bucket_size: how many parameters to prefetch before they are needed. Larger values improve overlap but increase peak memory.
  • stage3_param_persistence_threshold: parameters smaller than this threshold are kept on all GPUs (not sharded). Small parameters like layer norms and biases cost negligible memory but would otherwise require an all-gather for just a few kilobytes.
  • stage3_max_live_parameters: limits the maximum number of parameters materialized on GPU simultaneously. Controls peak memory during forward/backward.
  • stage3_gather_16bit_weights_on_model_save: gathers all parameter shards to rank 0 for saving a standard checkpoint. Without this, you get sharded checkpoints that require the same number of GPUs to load.

Activation Memory: The Other Memory Consumer

ZeRO addresses model state memory (parameters, gradients, optimizer states) but not activation memory. Activations — the intermediate tensors saved during the forward pass for use in the backward pass — can consume enormous amounts of memory for large batch sizes and long sequences.

For a transformer with LL layers, hidden dimension hh, sequence length ss, and batch size bb:

MactivationsLsbh10 bytesM_{\text{activations}} \approx L \cdot s \cdot b \cdot h \cdot 10 \text{ bytes}

For a 70B model (80 layers, h=8192h = 8192, s=2048s = 2048, b=2b = 2):

Mactivations80×2048×2×8192×1026.8 GBM_{\text{activations}} \approx 80 \times 2048 \times 2 \times 8192 \times 10 \approx 26.8 \text{ GB}

This must be added to the ZeRO model state memory. Activation checkpointing (recomputing activations during the backward pass instead of storing them) reduces this to approximately L\sqrt{L} checkpoints worth of activation memory at the cost of ~33% more compute.

ℹ️ ZeRO + Activation Checkpointing

In practice, large model training always combines ZeRO with activation checkpointing. ZeRO handles model state memory; activation checkpointing handles activation memory. Together, they make it possible to train models that would otherwise require orders of magnitude more GPU memory.

When NOT to Use ZeRO-3

ZeRO-3 is not always the optimal choice. Here are scenarios where alternatives are better.

Small Models That Fit With DDP

If your model fits comfortably in GPU memory with standard DDP (parameters + gradients + optimizer states are all less than, say, 60% of GPU memory), ZeRO-3 adds unnecessary communication overhead and implementation complexity. A 7B model with Adam needs 7×16=1127 \times 16 = 112 GB, which does not fit on a single 80 GB GPU — but ZeRO-2 would suffice (parameters stay replicated, gradients and optimizer states are sharded). No need for ZeRO-3’s parameter sharding.

For models under ~2B parameters, standard DDP with gradient accumulation is often the fastest and simplest option.

When Tensor Parallelism Is Available

Within a single node with NVLink/NVSwitch (for example, 8x H100 with 900 GB/s bisection bandwidth), tensor parallelism (TP) splits the actual matrix computations across GPUs with much lower communication overhead than ZeRO-3.

TP communication per layer: two all-reduce operations of size b×s×hb \times s \times h (activation tensors, not the full model). For a 70B model with sequence length 2048, batch size 2, hidden size 8192:

VTP per layer=2×2×2048×8192×2=134 MBV_{\text{TP per layer}} = 2 \times 2 \times 2048 \times 8192 \times 2 = 134 \text{ MB}

Over 80 layers: 80×134=10.780 \times 134 = 10.7 GB per step.

ZeRO-3 communication per step: 840 GB (as computed earlier).

TP communicates ~80x less data than ZeRO-3 for the same model. The trade-off is that TP requires the model architecture to be explicitly partitioned (column/row parallel linear layers), and it does not scale beyond a single node efficiently (NVLink bandwidth is needed).

Best practice: Use TP within a node (TP degree = 8 for 8-GPU nodes) and ZeRO/DP across nodes. This is the standard “3D parallelism” approach used by Megatron-LM, Megatron-DeepSpeed, and similar frameworks.

💡 The 3D Parallelism Rule of Thumb

Tensor Parallelism within a node (8 GPUs connected by NVLink). Pipeline Parallelism across 2-8 nodes. Data Parallelism (with ZeRO-1) across the remaining GPU groups. This minimizes communication across slow inter-node links while maximizing memory efficiency.

Latency-Sensitive Training

ZeRO-3 introduces latency per layer due to the parameter all-gather. Even with prefetching, the first layer of the forward pass cannot be prefetched (there is no preceding computation to overlap with). For latency-sensitive workloads like reinforcement learning from human feedback (RLHF), where you alternate between generation (auto-regressive, many sequential steps) and training, ZeRO-3’s per-layer latency compounds.

In RLHF generation, each token requires a full forward pass through the model. With ZeRO-3, each forward pass requires all-gathering parameters layer by layer. For a model with 80 layers on 64 GPUs over InfiniBand NDR:

tgather per layer=2×(70B/80 layers)×2 bytes50 GB/s70 mst_{\text{gather per layer}} = \frac{2 \times (70\text{B} / 80\text{ layers}) \times 2\text{ bytes}}{50 \text{ GB/s}} \approx 70 \text{ ms}

Over 80 layers per token: 80×70=5.680 \times 70 = 5.6 seconds per token. This makes auto-regressive generation essentially impossible with ZeRO-3 across nodes. Solutions include using TP for the generation phase or keeping a full model replica for inference.

Summary: When to Use What

📊

Parallelism Strategy Decision Matrix

ScenarioRecommended StrategyWhy
Model fits on 1 GPU DDP Simplest, no overhead
Model fits with ZeRO-1/2 ZeRO-1 or ZeRO-2 No extra comm, good memory savings
Single node, 8 GPUs, large model TP=8 NVLink bandwidth, lower comm than ZeRO-3
Multi-node, very large model TP + PP + ZeRO-1 DP 3D parallelism minimizes cross-node comm
Limited GPUs, huge model ZeRO-3 + CPU Offload Only option when GPUs are scarce
Extreme: model exceeds CPU RAM ZeRO-Infinity (NVMe) Last resort, very slow

Practical Tuning Guide

Choosing the Right ZeRO Stage

def recommend_zero_stage(
    param_count_billions: float,
    gpu_memory_gb: float = 80,
    num_gpus: int = 8,
    has_nvlink: bool = True,
    has_fast_interconnect: bool = True
) -> str:
    """Recommend ZeRO stage based on hardware constraints."""
    bytes_per_param = 16  # Mixed-precision Adam
    model_memory_gb = param_count_billions * bytes_per_param

    # Can DDP handle it?
    if model_memory_gb < gpu_memory_gb * 0.6:
        return "DDP (no ZeRO needed)"

    # Can ZeRO-1 handle it?
    zero1_mem = param_count_billions * 4 + (param_count_billions * 12) / num_gpus
    if zero1_mem < gpu_memory_gb * 0.7:
        return "ZeRO-1"

    # Can ZeRO-2 handle it?
    zero2_mem = param_count_billions * 2 + (param_count_billions * 14) / num_gpus
    if zero2_mem < gpu_memory_gb * 0.7:
        return "ZeRO-2"

    # ZeRO-3 needed
    zero3_mem = (param_count_billions * 16) / num_gpus
    if zero3_mem < gpu_memory_gb * 0.7:
        if has_nvlink and num_gpus <= 8:
            return "Consider TP=8 instead of ZeRO-3 (lower comm overhead)"
        return "ZeRO-3"

    # Even ZeRO-3 doesn't fit -- need offloading
    return "ZeRO-3 + CPU Offload (or add more GPUs)"

# Examples
print(recommend_zero_stage(7, 80, 8))    # "ZeRO-1"
print(recommend_zero_stage(13, 80, 8))   # "ZeRO-2"
print(recommend_zero_stage(70, 80, 8))   # "ZeRO-3 + CPU Offload"
print(recommend_zero_stage(70, 80, 64))  # "ZeRO-3"
print(recommend_zero_stage(70, 80, 16))  # "ZeRO-3"

Common Pitfalls

1. Not using gradient accumulation with ZeRO-3. The per-layer all-gather cost is fixed regardless of micro-batch size. By using gradient accumulation (multiple micro-batches per optimizer step), you amortize the communication cost over more compute.

2. Setting prefetch bucket size too small. If stage3_prefetch_bucket_size is smaller than the largest layer in your model, ZeRO must issue multiple all-gathers per layer, increasing latency. Set it to at least the size of your largest linear layer.

3. Forgetting stage3_gather_16bit_weights_on_model_save. Without this flag, model.save_pretrained() will save sharded weights that require the same number of GPUs to load. This makes checkpoints non-portable.

4. Using ZeRO-3 within a node when TP would be faster. If all GPUs are connected by NVLink, TP almost always outperforms ZeRO-3 due to the dramatically lower communication volume.

5. Mixing ZeRO-3 with pipeline parallelism incorrectly. ZeRO-3 shards across the data-parallel group. If you also use pipeline parallelism, the ZeRO-3 sharding only applies within each pipeline-parallel stage’s data-parallel group, not across pipeline stages.

Historical Context and Evolution

ZeRO was introduced in the paper “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” (Rajbhandari et al., October 2019). At the time, GPT-2 (1.5B parameters) was considered large. The paper demonstrated training models up to 100B parameters — unprecedented at the time.

The evolution since then:

DateDevelopmentImpact
Oct 2019ZeRO paper (Stages 1-3)Enabled 100B+ parameter training
Jan 2021ZeRO-OffloadCPU offloading for memory-constrained setups
Jun 2021ZeRO-InfinityNVMe offloading, theoretically unlimited model size
Mar 2022PyTorch FSDP (1.11)Native ZeRO-3 in PyTorch
Sep 2022ZeRO++Reduced communication via quantized weights
2023-2024FSDP2, DeepSpeed improvementsBetter torch.compile support, improved overlap

ZeRO’s core insight — that data-parallel training needlessly replicates state that could be partitioned — has become foundational. Every major training framework now implements some form of it. The question is no longer whether to shard optimizer states, but how aggressively to shard and what to combine it with.

Conclusion

ZeRO is a memory optimization that trades communication for capacity. The three stages offer a spectrum:

  • ZeRO-1 partitions optimizer states, saving 75% of optimizer memory with zero additional communication. It is a free lunch for any data-parallel training with Adam.
  • ZeRO-2 additionally partitions gradients, also with no extra communication volume (just restructured as reduce-scatter). Another free lunch, with the only cost being slightly more complex gradient handling.
  • ZeRO-3 partitions everything, achieving perfect 1/Nd1/N_d memory scaling but at 3×3\times the communication volume of standard DP. It is necessary for models that cannot fit even with ZeRO-2 but should be replaced by tensor parallelism when NVLink is available within a node.

The practical formula for a 70B model with mixed-precision Adam: 16×70B=1.12 TB16 \times 70\text{B} = 1.12\text{ TB} total. With ZeRO-3 on 16 GPUs, each GPU holds 70 GB — just fitting on an 80 GB A100. With 64 GPUs, each GPU holds 17.5 GB, leaving ample room for activations and large batch sizes.

For production training at scale, the dominant pattern is 3D parallelism: tensor parallelism within a node, pipeline parallelism across a small number of nodes, and data parallelism with ZeRO-1 across the rest. ZeRO-3 serves as the fallback when you have fewer GPUs than ideal, and ZeRO-Offload/Infinity serves as the last resort when even GPU memory across your entire cluster is insufficient.

The key takeaway: ZeRO-1 and ZeRO-2 are almost always worth enabling because they save memory without additional communication. ZeRO-3 requires careful analysis of your interconnect bandwidth before committing to it. And if your model fits with tensor parallelism on a single node, that is almost always the better choice.