Part of Series Inference Optimization Timeline 38 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Llama 3.1 405B in FP16 requires 810 GB of weight storage. A single H100 has 80 GB of HBM. Even with FP8 quantization (405 GB), a minimum of 6 GPUs is needed just for weights, and you still need HBM for KV cache and activations. Distributed inference is mandatory for large models, and the choice of parallelism strategy directly determines latency, throughput, and hardware utilization.

Tensor Parallelism for Inference

Tensor parallelism (TP) splits each layer’s weight matrices across GPUs. For a linear layer Y=XWY = XW where WRd×dW \in \mathbb{R}^{d \times d}, TP splits WW column-wise across NN GPUs:

W=[W1W2WN],WiRd×d/NW = [W_1 | W_2 | \cdots | W_N], \quad W_i \in \mathbb{R}^{d \times d/N}

Each GPU computes Yi=XWiY_i = X W_i using its local shard. For the complete output, an all-reduce (sum) collects partial results.

Column-Parallel and Row-Parallel Linear Layers

import torch
import torch.distributed as dist

class ColumnParallelLinear:
    """Split weight along output dimension (columns).
    W: [input_dim, output_dim] -> each GPU has [input_dim, output_dim/N]
    Each GPU receives the FULL input X and produces output_dim/N outputs."""

    def __init__(self, weight_shard, tp_rank, tp_size):
        self.weight = weight_shard  # [input_dim, output_dim // tp_size]
        self.tp_rank = tp_rank
        self.tp_size = tp_size

    def forward(self, x):
        """x: [batch, seq_len, input_dim] (same on all ranks)
        returns: [batch, seq_len, output_dim // tp_size] (different on each rank)
        """
        # Local GEMM: full input x local weight shard
        # No communication needed
        return torch.matmul(x, self.weight)

class RowParallelLinear:
    """Split weight along input dimension (rows).
    W: [input_dim, output_dim] -> each GPU has [input_dim/N, output_dim]
    Each GPU receives input_dim/N of the input and produces full output_dim."""

    def __init__(self, weight_shard, tp_rank, tp_size, tp_group):
        self.weight = weight_shard  # [input_dim // tp_size, output_dim]
        self.tp_rank = tp_rank
        self.tp_size = tp_size
        self.tp_group = tp_group

    def forward(self, x):
        """x: [batch, seq_len, input_dim // tp_size] (different on each rank)
        returns: [batch, seq_len, output_dim] (same on all ranks after all-reduce)
        """
        # Local GEMM: partial input x local weight shard
        local_output = torch.matmul(x, self.weight)
        # All-reduce to get full output (sum across ranks)
        dist.all_reduce(local_output, group=self.tp_group)
        return local_output

Complete TP Transformer Layer

In Megatron-style TP, each transformer layer uses column-parallel for the first linear (QKV projection, gate/up projection) and row-parallel for the second (O projection, down projection). This requires exactly 2 all-reduces per layer.

class TPTransformerLayer:
    """Tensor-parallel transformer layer with 2 all-reduces."""

    def __init__(self, layer_weights, tp_rank, tp_size, tp_group):
        d = layer_weights["hidden_size"]
        kv_heads = layer_weights["num_kv_heads"]
        q_heads = layer_weights["num_attention_heads"]
        head_dim = d // q_heads
        ffn_dim = layer_weights["intermediate_size"]

        # Attention: column-parallel QKV, row-parallel O
        # Split Q heads across TP ranks
        q_heads_per_rank = q_heads // tp_size
        kv_heads_per_rank = max(1, kv_heads // tp_size)

        self.qkv_proj = ColumnParallelLinear(
            weight_shard=layer_weights["qkv"][:, tp_rank],
            tp_rank=tp_rank, tp_size=tp_size
        )
        self.o_proj = RowParallelLinear(
            weight_shard=layer_weights["o"][tp_rank, :],
            tp_rank=tp_rank, tp_size=tp_size, tp_group=tp_group
        )
        # All-reduce #1: after O projection

        # FFN: column-parallel gate+up, row-parallel down
        self.gate_up_proj = ColumnParallelLinear(
            weight_shard=layer_weights["gate_up"][:, tp_rank],
            tp_rank=tp_rank, tp_size=tp_size
        )
        self.down_proj = RowParallelLinear(
            weight_shard=layer_weights["down"][tp_rank, :],
            tp_rank=tp_rank, tp_size=tp_size, tp_group=tp_group
        )
        # All-reduce #2: after down projection

    def forward(self, hidden_states, kv_cache):
        """Forward pass with 2 all-reduces."""
        residual = hidden_states

        # Attention block
        normed = rms_norm(hidden_states)
        qkv = self.qkv_proj.forward(normed)  # Column-parallel, no comm
        q, k, v = split_qkv(qkv)
        attn_output = attention(q, k, v, kv_cache)
        hidden_states = self.o_proj.forward(attn_output)  # Row-parallel + ALL-REDUCE
        hidden_states = hidden_states + residual

        # FFN block
        residual = hidden_states
        normed = rms_norm(hidden_states)
        gate_up = self.gate_up_proj.forward(normed)  # Column-parallel, no comm
        gate, up = gate_up.chunk(2, dim=-1)
        ffn_out = torch.nn.functional.silu(gate) * up
        hidden_states = self.down_proj.forward(ffn_out)  # Row-parallel + ALL-REDUCE
        hidden_states = hidden_states + residual

        return hidden_states

TP Communication Cost

Each all-reduce on NN GPUs with message size MM bytes uses the ring all-reduce algorithm:

Tall-reduce=2×(N1)/N×M/bandwidthT_{\text{all-reduce}} = 2 \times (N-1)/N \times M / \text{bandwidth}

For Llama 70B with TP=8 and hidden dim 8192:

def tp_communication_cost(hidden_dim, batch_tokens, tp_size,
                           bw_gbps, num_layers):
    """Calculate TP communication overhead per forward pass."""
    # Message size per all-reduce: batch_tokens * hidden_dim * 2 bytes (FP16)
    msg_bytes = batch_tokens * hidden_dim * 2

    # Ring all-reduce: 2 * (N-1)/N * msg_bytes / bandwidth
    factor = 2 * (tp_size - 1) / tp_size
    ar_time = factor * msg_bytes / (bw_gbps * 1e9)

    # 2 all-reduces per layer
    ar_per_layer = 2 * ar_time
    total_ar = num_layers * ar_per_layer

    return {
        "msg_bytes_kb": msg_bytes / 1024,
        "ar_time_per_layer_us": ar_per_layer * 1e6,
        "total_ar_ms": total_ar * 1000,
        "ar_fraction_prefill": None,  # Depends on compute time
    }

# NVLink (900 GB/s bidirectional on H100):
# Decode batch=1: msg = 1 * 8192 * 2 = 16 KB
# AR time = 2 * 7/8 * 16KB / 900GB/s = 0.031 us per AR
# 2 AR/layer * 80 layers = 160 AR * 0.031 us = 5.0 us total

# InfiniBand NDR (400 Gbps = 50 GB/s):
# Decode batch=1: same msg = 16 KB
# AR time = 2 * 7/8 * 16KB / 50GB/s = 0.56 us per AR
# 160 AR * 0.56 us = 89.6 us total
📊

TP Communication Overhead: NVLink vs InfiniBand

InterconnectBandwidthAR per Layer (decode, B=1)Total 80 Layers% of Decode Time
NVLink (H100) 900 GB/s 0.031 us 5.0 us 0.01%
PCIe 5.0 64 GB/s 0.44 us 70 us 0.2%
InfiniBand NDR 50 GB/s 0.56 us 90 us 0.3%
Ethernet 100G 12.5 GB/s 2.24 us 358 us 1.1%
Performance

At decode batch=1, TP communication overhead is negligible even over InfiniBand (0.3% of decode time). The reason: the message size is tiny (16 KB per all-reduce). TP communication becomes significant during prefill with long prompts: at S=32K, the message is 32768×8192×2=512 MB32768 \times 8192 \times 2 = 512\text{ MB} per all-reduce, taking 1.14 ms over NVLink (significant when prefill per layer takes ~1 ms).

Pipeline Parallelism for Inference

Pipeline parallelism (PP) assigns different layers to different GPUs. For PP=4 with an 80-layer model: GPU 0 runs layers 0-19, GPU 1 runs layers 20-39, GPU 2 runs layers 40-59, GPU 3 runs layers 60-79.

Basic PP Implementation

class PPInferenceEngine:
    """Pipeline-parallel inference engine."""

    def __init__(self, model_layers, pp_rank, pp_size, pp_group):
        self.pp_rank = pp_rank
        self.pp_size = pp_size
        self.pp_group = pp_group

        # This rank's layers
        layers_per_rank = len(model_layers) // pp_size
        start = pp_rank * layers_per_rank
        end = start + layers_per_rank
        self.local_layers = model_layers[start:end]

        # First rank has embedding, last rank has LM head
        self.has_embedding = (pp_rank == 0)
        self.has_lm_head = (pp_rank == pp_size - 1)

    def forward(self, input_data):
        """Forward pass for this PP stage.
        Receives activations from previous stage, sends to next."""

        if self.has_embedding:
            # First stage: embed input tokens
            hidden_states = self.embedding(input_data)
        else:
            # Receive activations from previous stage
            hidden_states = self._recv_activation()

        # Process local layers
        for layer in self.local_layers:
            hidden_states = layer(hidden_states)

        if self.has_lm_head:
            # Last stage: compute logits
            logits = self.lm_head(hidden_states)
            return logits
        else:
            # Send activations to next stage
            self._send_activation(hidden_states)
            return None

    def _send_activation(self, tensor):
        """Send activation to next PP rank."""
        dist.send(tensor, dst=self.pp_rank + 1, group=self.pp_group)

    def _recv_activation(self):
        """Receive activation from previous PP rank."""
        # Must know the shape in advance
        buffer = torch.empty(self.activation_shape, device="cuda",
                              dtype=torch.float16)
        dist.recv(buffer, src=self.pp_rank - 1, group=self.pp_group)
        return buffer

The PP Bubble Problem for Inference

With a single request, PP has a critical inefficiency: only one GPU is active at a time. While GPU 0 processes layers 0-19, GPUs 1-3 are idle. The pipeline takes PP stage-times to process one request.

Single request, PP=4:

Time -->
GPU 0: [Stage 0] ........  ........  ........
GPU 1: ........  [Stage 1] ........  ........
GPU 2: ........  ........  [Stage 2] ........
GPU 3: ........  ........  ........  [Stage 3]

Utilization: 25% (each GPU active 1/4 of the time)
Latency: 4x a single-GPU (if model fit on one GPU)

Micro-Batching for PP

To improve utilization, split the batch into micro-batches and pipeline them:

class PPMicroBatchEngine:
    """Pipeline parallelism with micro-batching."""

    def __init__(self, model_layers, pp_rank, pp_size, num_micro_batches=4):
        self.pp_rank = pp_rank
        self.pp_size = pp_size
        self.num_micro = num_micro_batches

        layers_per_rank = len(model_layers) // pp_size
        start = pp_rank * layers_per_rank
        end = start + layers_per_rank
        self.local_layers = model_layers[start:end]

    def forward_pipeline(self, full_batch):
        """Process full batch as micro-batches through the pipeline."""

        # Split batch into micro-batches
        micro_batches = torch.chunk(full_batch, self.num_micro, dim=0)
        outputs = [None] * self.num_micro

        # Pipeline schedule: 1F1B (one forward, one backward - simplified for inference)
        # For inference, we just pipeline forwards
        for step in range(self.num_micro + self.pp_size - 1):
            micro_idx = step - self.pp_rank

            if 0 <= micro_idx < self.num_micro:
                if self.pp_rank == 0:
                    hidden = self.embedding(micro_batches[micro_idx])
                else:
                    hidden = self._recv_activation()

                for layer in self.local_layers:
                    hidden = layer(hidden)

                if self.pp_rank == self.pp_size - 1:
                    outputs[micro_idx] = self.lm_head(hidden)
                else:
                    self._send_activation(hidden)

        if self.pp_rank == self.pp_size - 1:
            return torch.cat([o for o in outputs if o is not None], dim=0)
        return None
4 micro-batches, PP=4:

Time -->
GPU 0: [M0] [M1] [M2] [M3] ....  ....  ....
GPU 1: ....  [M0] [M1] [M2] [M3] ....  ....
GPU 2: ....  ....  [M0] [M1] [M2] [M3] ....
GPU 3: ....  ....  ....  [M0] [M1] [M2] [M3]

Pipeline fill: 3 stages
Pipeline drain: 3 stages
Steady state: 4 micro-batches

Utilization: 4 / (4 + 3) = 57%
With 16 micro-batches: 16 / (16 + 3) = 84%

When TP Wins vs When PP Wins

def compare_tp_pp(model_config, hardware_config, batch_config):
    """Compare TP and PP latency for a given configuration."""
    d = model_config["hidden_dim"]
    L = model_config["num_layers"]
    N = hardware_config["num_gpus"]

    # Weight bytes per layer
    w_per_layer = 32 * d * d  # Approximate, FP16

    # TP: each GPU has w_per_layer/N weights, 2 all-reduces per layer
    tp_weight_per_gpu = w_per_layer / N
    tp_bw_time = tp_weight_per_gpu / hardware_config["hbm_bw"]
    tp_ar_time = 2 * d * batch_config["batch_tokens"] * 2 / hardware_config["interconnect_bw"]
    tp_layer_time = tp_bw_time + tp_ar_time
    tp_total = L * tp_layer_time

    # PP: each GPU has w_per_layer * (L/N) weights, 1 send per stage
    pp_weight_per_gpu = w_per_layer  # Full layer weight
    pp_layers_per_gpu = L // N
    pp_bw_time = pp_weight_per_gpu / hardware_config["hbm_bw"]
    pp_layer_time = pp_bw_time  # No all-reduce
    pp_stage_time = pp_layers_per_gpu * pp_layer_time
    pp_send_time = d * batch_config["batch_tokens"] * 2 / hardware_config["interconnect_bw"]

    # PP with micro-batching
    pp_num_micro = batch_config.get("num_micro_batches", N)
    pp_total = pp_stage_time * pp_num_micro + (N - 1) * pp_stage_time + (N - 1) * pp_send_time

    # PP per-token (average)
    pp_per_token = pp_total / (batch_config["batch_tokens"] * pp_num_micro)

    return {
        "tp_total_ms": tp_total * 1000,
        "pp_total_ms": pp_total * 1000,
        "tp_wins": tp_total < pp_total,
        "tp_advantage": "Low latency, all GPUs work on every token",
        "pp_advantage": "Less communication, works over slow interconnect",
    }
📊

TP vs PP: Latency Comparison (Llama 70B, 8 GPUs)

ConfigurationTP=8 Decode LatencyPP=8 Decode LatencyWinner
NVLink, batch=1 4.2 ms 33.8 ms (sequential) TP (8x faster)
NVLink, batch=128 4.5 ms/tok 4.5 ms/tok (pipelined) Tie
InfiniBand, batch=1 4.3 ms 34.5 ms TP (8x faster)
InfiniBand, batch=128 4.6 ms/tok 4.8 ms/tok TP (slight)
100G Ethernet, batch=1 5.1 ms 35.2 ms TP (7x faster)
100G Ethernet, batch=128 5.8 ms/tok 4.9 ms/tok PP (faster)

The pattern is clear:

  • TP wins for latency because all GPUs work on every layer simultaneously. The single-request latency is L×(tcompute/N+tall-reduce)L \times (t_{\text{compute}}/N + t_{\text{all-reduce}}).
  • PP wins for throughput over slow interconnects because it only sends activations once per stage (not 2x per layer). With micro-batching, the pipeline stays full.
  • TP requires fast interconnect (NVLink or InfiniBand) because it communicates 2L times per forward pass. PP communicates only P-1 times.

Single-Request Latency: TP vs PP vs TP+PP (Llama 405B, 16 GPUs)

Metric TP=16PP=16TP=8,PP=2TP=4,PP=4TP=2,PP=8
Decode Latency (ms), batch=1
5.2
132
8.8
15.4
35.2

Combined TP+PP for 405B Models

For Llama 405B on a 2-node cluster with 8 H100s per node (16 GPUs total):

  • TP=16: requires all 16 GPUs in one all-reduce group. Cross-node all-reduce over InfiniBand is slow.
  • PP=16: 16 pipeline stages, terrible latency for single requests.
  • TP=8 PP=2: TP within each node (NVLink), PP across nodes (InfiniBand). Best of both worlds.
class TPPPInferenceEngine:
    """Combined TP+PP inference engine for multi-node serving."""

    def __init__(self, model, tp_size=8, pp_size=2, global_rank=0):
        self.tp_size = tp_size
        self.pp_size = pp_size

        # Determine this rank's role
        self.tp_rank = global_rank % tp_size
        self.pp_rank = global_rank // tp_size

        # Communication groups
        # TP group: ranks on the same node
        self.tp_group = self._create_tp_group()
        # PP group: corresponding ranks across nodes
        self.pp_group = self._create_pp_group()

        # This rank's model shard
        total_layers = model.config.num_hidden_layers
        layers_per_pp = total_layers // pp_size
        layer_start = self.pp_rank * layers_per_pp
        layer_end = layer_start + layers_per_pp

        self.layers = []
        for i in range(layer_start, layer_end):
            # Each layer is TP-sharded across tp_size GPUs
            self.layers.append(
                TPTransformerLayer(
                    model.layers[i].get_tp_shard(self.tp_rank, tp_size),
                    tp_rank=self.tp_rank,
                    tp_size=tp_size,
                    tp_group=self.tp_group,
                )
            )

    def forward(self, input_data):
        """Forward pass through TP+PP engine.
        TP all-reduces happen within each layer (NVLink, fast).
        PP activation transfers happen between stages (InfiniBand, once per stage).
        """
        if self.pp_rank == 0:
            hidden = self.embed_tp(input_data)  # TP-parallel embedding
        else:
            hidden = self._pp_recv()  # Receive from previous PP stage

        for layer in self.layers:
            hidden = layer.forward(hidden)  # Includes TP all-reduces

        if self.pp_rank == self.pp_size - 1:
            logits = self.lm_head_tp(hidden)  # TP-parallel LM head
            return logits
        else:
            self._pp_send(hidden)  # Send to next PP stage
            return None

Communication Analysis for TP=8 PP=2

def tppp_communication_analysis():
    """Communication breakdown for Llama 405B, TP=8 PP=2."""
    d = 16384  # 405B hidden dim
    L = 126    # 405B layers
    layers_per_pp = L // 2  # 63 layers per PP stage

    # TP all-reduces (within node, NVLink 900 GB/s)
    # 2 per layer, msg = batch_tokens * d * 2 bytes
    tp_ar_count = 2 * layers_per_pp  # 126 per stage
    # Decode batch=1: msg = 1 * 16384 * 2 = 32 KB
    tp_ar_time_each = 2 * 7/8 * 32768 / (900e9)  # 0.064 us
    tp_ar_total = tp_ar_count * tp_ar_time_each  # 8.1 us

    # PP activation transfer (across nodes, InfiniBand 50 GB/s)
    # 1 transfer, msg = batch_tokens * d * 2 bytes = 32 KB
    pp_transfer_time = 32768 / (50e9)  # 0.66 us
    # But InfiniBand has ~1us latency overhead
    pp_total = 0.66e-6 + 1e-6  # 1.66 us

    return {
        "tp_ar_total_us": tp_ar_total * 1e6,
        "pp_transfer_us": pp_total * 1e6,
        "total_comm_us": (tp_ar_total + pp_total) * 1e6,
    }
📊

Llama 405B Communication Overhead: Different TP+PP Configurations

ConfigTP Comm (per fwd)PP Comm (per fwd)Total Comm% of Decode
TP=16 PP=1 Cross-node AR: 0.4 ms None 0.4 ms 5.8%
TP=8 PP=2 Intra-node AR: 0.008 ms 1 IB send: 0.002 ms 0.010 ms 0.15%
TP=4 PP=4 Intra-node AR: 0.006 ms 3 IB sends: 0.006 ms 0.012 ms 0.18%
TP=2 PP=8 Intra-node AR: 0.004 ms 7 IB sends: 0.012 ms 0.016 ms 0.24%
ℹ️ Note

TP=8 PP=2 is the sweet spot for 2-node 405B serving: TP stays within the NVLink domain (900 GB/s), while PP only crosses the node boundary once (InfiniBand, 50 GB/s). The total communication overhead is less than 0.2% of decode time. TP=16 PP=1 forces cross-node all-reduces, increasing communication overhead by 40x.

vLLM’s TP Implementation

# Simplified from vLLM's distributed implementation
class VLLMTPWorker:
    """vLLM tensor-parallel worker (symmetric architecture in v1)."""

    def __init__(self, tp_rank, tp_size, model_config):
        self.tp_rank = tp_rank
        self.tp_size = tp_size

        # Initialize NCCL
        self.tp_group = dist.new_group(ranks=list(range(tp_size)))

        # Load model shard for this rank
        self.model = self._load_model_shard(model_config)

    def _load_model_shard(self, config):
        """Load only this rank's weight shard from safetensors."""
        from safetensors import safe_open

        # Each safetensors file may contain all ranks' weights
        # We load only our shard based on naming convention
        state_dict = {}
        for path in config.model_paths:
            f = safe_open(str(path), framework="pt", device=f"cuda:{self.tp_rank}")
            for name in f.keys():
                tensor = f.get_tensor(name)
                # Shard column-parallel weights
                if self._is_column_parallel(name):
                    chunk_size = tensor.shape[1] // self.tp_size
                    tensor = tensor[:, self.tp_rank * chunk_size:(self.tp_rank + 1) * chunk_size]
                # Shard row-parallel weights
                elif self._is_row_parallel(name):
                    chunk_size = tensor.shape[0] // self.tp_size
                    tensor = tensor[self.tp_rank * chunk_size:(self.tp_rank + 1) * chunk_size, :]
                state_dict[name] = tensor

        return state_dict

    def execute_model(self, batch):
        """All TP ranks execute the same batch synchronously.
        No explicit scheduling coordination needed (symmetric workers)."""
        # Every rank processes the same batch
        # All-reduces inside the model keep ranks synchronized
        logits = self.model.forward(
            batch.input_ids,
            batch.positions,
            self.kv_caches,
            batch.attn_metadata,
        )
        return logits

Choosing Your Parallelism Strategy

def recommend_parallelism(model_size_gb, num_gpus, gpu_hbm_gb,
                           intra_node_bw_gbps, inter_node_bw_gbps,
                           gpus_per_node, latency_priority=True):
    """Recommend TP/PP split."""

    # Minimum GPUs for weights
    min_gpus = max(1, int(model_size_gb / (gpu_hbm_gb * 0.6) + 0.5))

    if num_gpus <= gpus_per_node:
        # Single node: pure TP (NVLink is fast)
        return {"tp": num_gpus, "pp": 1, "reason": "Single node, NVLink available"}

    if latency_priority:
        # Multi-node, latency-sensitive: TP within node, PP across nodes
        num_nodes = num_gpus // gpus_per_node
        tp = gpus_per_node
        pp = num_nodes
        return {"tp": tp, "pp": pp,
                "reason": "Latency priority: TP within NVLink, PP across IB"}
    else:
        # Multi-node, throughput priority: can use more PP
        # PP allows more micro-batching = higher throughput
        tp = min(gpus_per_node, min_gpus)
        pp = num_gpus // tp
        return {"tp": tp, "pp": pp,
                "reason": "Throughput priority: smaller TP for less comm"}

# Examples:
# 70B on 8 H100 (1 node):    TP=8, PP=1
# 70B on 16 H100 (2 nodes):  TP=8, PP=2 (latency) or TP=4, PP=4 (throughput)
# 405B on 16 H100 (2 nodes): TP=8, PP=2
# 405B on 32 H100 (4 nodes): TP=8, PP=4
📊

Recommended Configurations for Common Models

ModelGPUsNodesTPPPDecode Latency (B=1)
Llama 8B 1 1 1 1 12 ms
Llama 70B 8 1 8 1 4.2 ms
Llama 70B 16 2 8 2 4.5 ms
Llama 405B 16 2 8 2 6.8 ms
Llama 405B 32 4 8 4 7.2 ms

Expert Parallelism for MoE Models

Mixture-of-Experts models (DeepSeek V3, Mixtral) add a third dimension: expert parallelism (EP). Each GPU holds a subset of experts, and tokens are routed to the appropriate GPU based on the gating function’s output.

class ExpertParallelLayer:
    """Expert parallelism for MoE inference."""

    def __init__(self, experts, ep_rank, ep_size, ep_group):
        self.ep_rank = ep_rank
        self.ep_size = ep_size
        self.ep_group = ep_group

        # This rank's local experts
        experts_per_rank = len(experts) // ep_size
        start = ep_rank * experts_per_rank
        self.local_experts = experts[start:start + experts_per_rank]

    def forward(self, hidden_states, router_logits):
        """MoE forward with expert parallelism.
        1. Router selects top-K experts per token
        2. All-to-all sends tokens to the GPU holding their expert
        3. Each GPU runs its local experts
        4. All-to-all sends results back
        """
        # Router: select top-2 experts per token
        routing_weights, expert_indices = self._route(router_logits)

        # Prepare dispatch: group tokens by destination GPU
        tokens_per_rank = self._compute_dispatch(
            hidden_states, expert_indices
        )

        # All-to-all: send tokens to expert-owning GPUs
        received_tokens = self._all_to_all(tokens_per_rank)

        # Execute local experts on received tokens
        local_outputs = {}
        for expert_id, tokens in received_tokens.items():
            local_expert_idx = expert_id - self.ep_rank * len(self.local_experts)
            local_outputs[expert_id] = self.local_experts[local_expert_idx](tokens)

        # All-to-all: send results back to original GPUs
        result_tokens = self._all_to_all_reverse(local_outputs)

        # Combine expert outputs with routing weights
        output = self._combine(result_tokens, routing_weights)
        return output

For a model like DeepSeek V3 (256 experts, top-8 routing), the typical configuration on a 2-node, 16-GPU cluster is:

  • TP=1 or 2: Within-layer weight splitting (for the shared attention layers)
  • EP=8 or 16: Expert distribution across GPUs
  • PP=1 or 2: Pipeline across nodes if needed

The all-to-all communication pattern in EP is different from TP’s all-reduce: it sends different data to different GPUs rather than aggregating the same data. This means EP bandwidth scales with the number of tokens routed cross-GPU, which depends on the load balance of the router.

Parallelism Strategies: Communication Pattern Comparison

Metric TP all-reducePP activation sendEP all-to-allTP+PP combined
Communication Volume per Layer (MB, Llama 70B batch=128)
2
2
8
2

The rule of thumb: TP within the NVLink domain, PP across nodes. TP gives the lowest latency (all GPUs contribute to every token), but it requires high-bandwidth interconnect for the frequent all-reduces. PP requires minimal interconnect bandwidth (one activation transfer per stage) but adds pipeline latency. For production serving, the TP=8 PP=2 configuration on 2-node H100 clusters has become the de facto standard for 405B-class models, delivering sub-7ms decode latency with less than 0.2% communication overhead. For MoE models, expert parallelism adds a third axis that interacts with TP and PP — the optimal configuration depends on the expert count, top-K routing value, and the balance between shared (attention) and expert (FFN) computation.