Part of Series NVIDIA Dynamo & llm-d 21 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

With default NCCL settings, cross-node all-reduce for Llama 70B TP=4 measures 23.4 GB/s on 400 Gb/s InfiniBand—only 46% of theoretical bandwidth. The culprit: NCCL defaults to Ring algorithm which serializes communication, and uses IB verbs without GPUDirect RDMA, adding CPU bounce latency. Setting NCCL_ALGO=Tree, NCCL_PROTO=Simple, and enabling GPUDirect increases effective bandwidth to 44.1 GB/s (88% of theoretical), reducing per-step all-reduce latency from 683μs to 362μs. For a 2048-token decode batch, that’s 321ms saved per 1000 tokens. This post covers the complete NCCL tuning procedure.

Network Topology for LLM Serving

GPU-to-GPU Communication Paths

from dataclasses import dataclass
from enum import Enum

class InterconnectType(Enum):
    NVLINK = "nvlink"
    NVSWITCH = "nvswitch"
    PCIE = "pcie"
    INFINIBAND = "infiniband"
    ETHERNET = "ethernet"

@dataclass
class InterconnectSpec:
    interconnect_type: InterconnectType
    bandwidth_gbps: float
    latency_us: float
    description: str

# H100 SXM cluster interconnect hierarchy
INTERCONNECT_HIERARCHY = [
    InterconnectSpec(
        interconnect_type=InterconnectType.NVSWITCH,
        bandwidth_gbps=900,
        latency_us=1.0,
        description="Intra-node GPU-to-GPU via NVSwitch. "
                    "8 GPUs fully connected. 900 GB/s bisection.",
    ),
    InterconnectSpec(
        interconnect_type=InterconnectType.INFINIBAND,
        bandwidth_gbps=400,
        latency_us=2.0,
        description="Inter-node same-rack via InfiniBand NDR "
                    "(400 Gb/s per port, 8 ports per node = "
                    "3.2 Tb/s aggregate per node).",
    ),
    InterconnectSpec(
        interconnect_type=InterconnectType.INFINIBAND,
        bandwidth_gbps=400,
        latency_us=5.0,
        description="Inter-node cross-rack via InfiniBand NDR "
                    "through spine switches. Same bandwidth but "
                    "higher latency due to extra hop.",
    ),
    InterconnectSpec(
        interconnect_type=InterconnectType.ETHERNET,
        bandwidth_gbps=100,
        latency_us=50.0,
        description="Fallback: RoCEv2 over 100 GbE. Used when "
                    "InfiniBand is unavailable. 4-5x slower.",
    ),
]
📊

Interconnect Performance for LLM Communication Patterns

InterconnectBandwidthLatencyAll-Reduce 16MBAll-Reduce 256MB
NVSwitch (intra-node) 900 GB/s 1 us 0.02 ms 0.3 ms
InfiniBand NDR (same rack) 50 GB/s (per link) 2 us 0.35 ms 5.2 ms
InfiniBand NDR (cross rack) 50 GB/s (per link) 5 us 0.38 ms 5.5 ms
RoCEv2 100GbE 12.5 GB/s 50 us 1.3 ms 20.5 ms
InfiniBand NDR x8 (aggregate) 400 GB/s 2 us 0.05 ms 0.7 ms
Note: All-reduce times for ring algorithm. Actual times depend on topology and NCCL algorithm selection. x8 aggregate assumes perfect load balancing across 8 InfiniBand ports.

NCCL Configuration

Environment Variables That Matter

NCCL’s behavior is controlled by environment variables. The defaults are reasonable for training but often suboptimal for inference, which has different communication patterns (smaller messages, more frequent, latency-sensitive).

import os
import subprocess

@dataclass
class NCCLParameter:
    name: str
    default: str
    recommended_inference: str
    explanation: str
    impact: str

NCCL_PARAMETERS = [
    NCCLParameter(
        name="NCCL_ALGO",
        default="auto",
        recommended_inference="Ring",
        explanation="Communication algorithm. Ring is optimal for "
                    "small-to-medium messages (typical in inference). "
                    "Tree is better for very large messages "
                    "(typical in training gradient all-reduce).",
        impact="Ring vs Tree: 10-30% difference for 16MB all-reduce. "
               "Ring wins below ~64MB, Tree wins above ~256MB.",
    ),
    NCCLParameter(
        name="NCCL_PROTO",
        default="auto",
        recommended_inference="Simple",
        explanation="Communication protocol. Simple uses direct "
                    "RDMA writes. LL (low-latency) uses 8-byte "
                    "inline messages for small transfers. "
                    "LL128 uses 128-byte for medium transfers.",
        impact="LL reduces latency by 20-40% for messages under 8KB. "
               "Simple is better for larger messages.",
    ),
    NCCLParameter(
        name="NCCL_BUFFSIZE",
        default="4194304",  # 4MB
        recommended_inference="2097152",  # 2MB
        explanation="NCCL internal buffer size. Larger buffers "
                    "improve throughput for large messages but "
                    "waste GPU memory for small messages.",
        impact="For inference (small messages), reducing from 4MB "
               "to 2MB saves 16MB GPU memory per communicator "
               "with negligible throughput impact.",
    ),
    NCCLParameter(
        name="NCCL_NTHREADS",
        default="512",
        recommended_inference="256",
        explanation="Number of CUDA threads per NCCL kernel. "
                    "More threads improve throughput but consume "
                    "SMs that could be used for model computation.",
        impact="Reducing from 512 to 256 frees 1-2 SMs per GPU "
               "for inference computation.",
    ),
    NCCLParameter(
        name="NCCL_MIN_NCHANNELS",
        default="auto",
        recommended_inference="2",
        explanation="Minimum number of communication channels. "
                    "More channels = more parallelism = more SMs used.",
        impact="For inference, 2 channels are sufficient. "
               "Training benefits from 8-16 channels.",
    ),
    NCCLParameter(
        name="NCCL_MAX_NCHANNELS",
        default="auto",
        recommended_inference="4",
        explanation="Maximum number of communication channels.",
        impact="Capping at 4 for inference saves SM resources.",
    ),
    NCCLParameter(
        name="NCCL_IB_HCA",
        default="auto",
        recommended_inference="mlx5",
        explanation="InfiniBand HCA (Host Channel Adapter) device "
                    "prefix. Setting explicitly avoids auto-detection "
                    "issues on multi-HCA nodes.",
        impact="Prevents NCCL from using the wrong HCA "
               "(e.g., a management port).",
    ),
    NCCLParameter(
        name="NCCL_IB_GID_INDEX",
        default="auto",
        recommended_inference="3",
        explanation="InfiniBand GID index for RoCE. "
                    "Index 3 is typically the RoCEv2 GID.",
        impact="Wrong GID causes connection failures or "
               "falls back to slower protocol.",
    ),
    NCCLParameter(
        name="NCCL_NET_GDR_LEVEL",
        default="auto",
        recommended_inference="5",
        explanation="GPUDirect RDMA level. Level 5 enables "
                    "GPU-to-GPU RDMA over InfiniBand, bypassing "
                    "CPU memory entirely.",
        impact="GPUDirect RDMA reduces latency by 30-50% "
               "for cross-node communication.",
    ),
    NCCLParameter(
        name="NCCL_CROSS_NIC",
        default="0",
        recommended_inference="1",
        explanation="Allow cross-NIC communication. Enables "
                    "GPU 0 to use any InfiniBand port, not just "
                    "the topologically closest one.",
        impact="Improves load balancing across IB ports. "
               "5-15% throughput improvement for all-reduce.",
    ),
]

class NCCLConfigurator:
    """
    Configure NCCL environment variables for optimal
    LLM inference performance.
    """

    def __init__(self, mode="inference"):
        self.mode = mode
        self.params = {}

    def auto_configure(self):
        """
        Auto-detect hardware topology and set optimal
        NCCL parameters.
        """
        topology = self._detect_topology()

        if self.mode == "inference":
            self._configure_inference(topology)
        else:
            self._configure_training(topology)

        return self.params

    def _detect_topology(self):
        """Detect GPU and network topology."""
        topology = {
            "n_gpus": 0,
            "n_nodes": 1,
            "has_nvswitch": False,
            "has_infiniband": False,
            "ib_ports": 0,
            "gpu_type": "unknown",
        }

        # Detect GPUs
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=name",
                 "--format=csv,noheader"],
                capture_output=True, text=True, timeout=10,
            )
            gpus = result.stdout.strip().split("\n")
            topology["n_gpus"] = len(gpus)
            if gpus:
                topology["gpu_type"] = gpus[0].strip()
        except (subprocess.TimeoutExpired, FileNotFoundError):
            pass

        # Detect NVSwitch
        try:
            result = subprocess.run(
                ["nvidia-smi", "nvlink", "--status"],
                capture_output=True, text=True, timeout=10,
            )
            if "NVSwitch" in result.stdout:
                topology["has_nvswitch"] = True
        except (subprocess.TimeoutExpired, FileNotFoundError):
            pass

        # Detect InfiniBand
        try:
            result = subprocess.run(
                ["ibstat", "-l"],
                capture_output=True, text=True, timeout=10,
            )
            ports = result.stdout.strip().split("\n")
            topology["has_infiniband"] = len(ports) > 0
            topology["ib_ports"] = len(ports)
        except (subprocess.TimeoutExpired, FileNotFoundError):
            pass

        return topology

    def _configure_inference(self, topology):
        """Set NCCL params for inference."""
        self.params = {
            "NCCL_ALGO": "Ring",
            "NCCL_PROTO": "Simple",
            "NCCL_BUFFSIZE": str(2 * 1024 * 1024),  # 2MB
            "NCCL_NTHREADS": "256",
            "NCCL_MIN_NCHANNELS": "2",
            "NCCL_MAX_NCHANNELS": "4",
        }

        if topology["has_infiniband"]:
            self.params["NCCL_IB_HCA"] = "mlx5"
            self.params["NCCL_NET_GDR_LEVEL"] = "5"
            self.params["NCCL_CROSS_NIC"] = "1"
            self.params["NCCL_IB_GID_INDEX"] = "3"

        # H100 specific
        if "H100" in topology["gpu_type"]:
            self.params["NCCL_NVLS_ENABLE"] = "1"  # NVLink SHARP

    def _configure_training(self, topology):
        """Set NCCL params for training."""
        self.params = {
            "NCCL_ALGO": "auto",
            "NCCL_PROTO": "auto",
            "NCCL_BUFFSIZE": str(4 * 1024 * 1024),  # 4MB
            "NCCL_NTHREADS": "512",
            "NCCL_MIN_NCHANNELS": "8",
            "NCCL_MAX_NCHANNELS": "16",
        }

        if topology["has_infiniband"]:
            self.params["NCCL_IB_HCA"] = "mlx5"
            self.params["NCCL_NET_GDR_LEVEL"] = "5"
            self.params["NCCL_CROSS_NIC"] = "1"

    def apply(self):
        """Apply NCCL configuration to environment."""
        for key, value in self.params.items():
            os.environ[key] = value

    def export_script(self):
        """Generate a shell script to set NCCL environment."""
        lines = ["#!/bin/bash",
                 "# NCCL configuration for LLM inference",
                 f"# Auto-generated for mode: {self.mode}",
                 ""]
        for key, value in sorted(self.params.items()):
            lines.append(f"export {key}={value}")
        return "\n".join(lines)
💡 Tip

The single most impactful NCCL change for inference: set NCCL_ALGO=Ring. For the typical inference all-reduce size (8-64 MB), Ring outperforms Tree by 15-30%. Tree’s advantage only appears above 256 MB, which is rare in inference (common in training gradient all-reduce for large models).

InfiniBand Tuning

Port Configuration and Routing

class InfiniBandTuner:
    """
    Tune InfiniBand configuration for LLM serving.

    Key areas:
    1. Port speed and width verification
    2. Adaptive routing configuration
    3. Service level (QoS) assignment
    4. Congestion control
    """

    def __init__(self):
        self.diagnostics = []

    def verify_link_status(self):
        """
        Verify all InfiniBand links are at expected speed.
        NDR: 400 Gb/s per port.
        HDR: 200 Gb/s per port.
        """
        checks = []

        try:
            result = subprocess.run(
                ["ibstat"], capture_output=True, text=True, timeout=10,
            )
            output = result.stdout

            # Parse port states
            port_sections = output.split("Port ")
            for section in port_sections[1:]:  # Skip header
                lines = section.strip().split("\n")

                port_info = {"port": "", "state": "", "rate": ""}
                for line in lines:
                    line = line.strip()
                    if line.startswith("State:"):
                        port_info["state"] = line.split(":")[1].strip()
                    elif line.startswith("Rate:"):
                        port_info["rate"] = line.split(":")[1].strip()

                is_active = port_info["state"] == "Active"
                expected_rate = "400" in port_info["rate"]  # NDR

                checks.append({
                    "port": port_info["port"],
                    "state": port_info["state"],
                    "rate": port_info["rate"],
                    "healthy": is_active and expected_rate,
                    "issue": (
                        None if (is_active and expected_rate) else
                        f"Port not at expected state/rate: "
                        f"state={port_info['state']}, "
                        f"rate={port_info['rate']}"
                    ),
                })
        except (subprocess.TimeoutExpired, FileNotFoundError):
            checks.append({
                "port": "unknown",
                "healthy": False,
                "issue": "ibstat command failed",
            })

        return checks

    def configure_adaptive_routing(self):
        """
        Enable adaptive routing on InfiniBand switches.

        Adaptive routing allows packets to use any available
        path between source and destination, avoiding
        congested links. This improves throughput by 10-30%
        for all-reduce patterns.
        """
        commands = [
            # Enable adaptive routing on the switch
            # (requires switch access -- these are for documentation)
            "# On each InfiniBand switch:",
            "# smpquery portinfo <lid> <port>",
            "# perfquery <lid> <port>",
            "",
            "# Enable adaptive routing via OpenSM:",
            "# Add to opensm.conf:",
            "#   routing_engine updn",
            "#   ar_enable 1",
            "",
            "# Verify adaptive routing is active:",
            "# sminfo | grep 'Adaptive Routing'",
        ]
        return commands

    def configure_qos(self):
        """
        Configure Quality of Service for NCCL traffic.

        Assign NCCL traffic to a high-priority service level
        to avoid interference from other IB traffic (storage,
        management).
        """
        return {
            "service_level": 5,  # High priority
            "nccl_env": {
                "NCCL_IB_SL": "5",
                "NCCL_IB_TC": "160",  # Traffic class for SL5
            },
            "description": (
                "Service Level 5 with Traffic Class 160. "
                "Requires matching configuration on IB switches "
                "to map SL5 to a high-priority VL (Virtual Lane)."
            ),
        }

    def configure_congestion_control(self):
        """
        Configure ECN-based congestion control for RoCE.

        For InfiniBand native: credit-based flow control handles
        congestion automatically.
        For RoCEv2: need explicit ECN/DCQCN configuration.
        """
        return {
            "infiniband_native": {
                "config": "Credit-based flow control (automatic)",
                "nccl_env": {},
            },
            "roce_v2": {
                "config": "ECN + DCQCN",
                "nccl_env": {
                    "NCCL_IB_PCI_RELAXED_ORDERING": "1",
                    "NCCL_NET_GDR_LEVEL": "5",
                },
                "switch_config": [
                    "# Enable ECN marking at 80% buffer threshold",
                    "# dcbx mode cee",
                    "# interface <port>",
                    "#   dcb priority-flow-control mode on",
                    "#   dcb ets traffic-class 3 bandwidth 90",
                ],
            },
        }

GPUDirect RDMA

Bypassing the CPU

GPUDirect RDMA allows the InfiniBand NIC to read from and write to GPU memory directly, bypassing CPU memory. Without GPUDirect RDMA, an all-reduce involves: GPU memory -> CPU memory (PCIe) -> InfiniBand NIC -> network -> InfiniBand NIC -> CPU memory (PCIe) -> GPU memory. With GPUDirect RDMA: GPU memory -> InfiniBand NIC -> network -> InfiniBand NIC -> GPU memory. Two fewer PCIe copies.

class GPUDirectRDMAConfig:
    """
    Configure GPUDirect RDMA for NCCL communication.

    Requirements:
    1. NVIDIA GPU driver with GPUDirect RDMA support
    2. Mellanox OFED with nv_peer_memory module
    3. NCCL built with GPUDirect RDMA support
    4. Correct PCIe topology (GPU and NIC on same NUMA node)
    """

    def verify_prerequisites(self):
        """
        Verify all prerequisites for GPUDirect RDMA.
        """
        checks = []

        # Check 1: nvidia-peermem kernel module
        checks.append(self._check_module("nvidia_peermem"))

        # Check 2: OFED installation
        checks.append(self._check_ofed())

        # Check 3: PCIe topology
        checks.append(self._check_pcie_topology())

        # Check 4: NCCL GDR support
        checks.append(self._check_nccl_gdr())

        return checks

    def _check_module(self, module_name):
        """Check if a kernel module is loaded."""
        try:
            result = subprocess.run(
                ["lsmod"], capture_output=True, text=True, timeout=10,
            )
            loaded = module_name in result.stdout
            return {
                "check": f"kernel_module_{module_name}",
                "passed": loaded,
                "detail": (
                    f"{module_name} loaded"
                    if loaded else
                    f"{module_name} NOT loaded. "
                    f"Run: modprobe {module_name}"
                ),
            }
        except (subprocess.TimeoutExpired, FileNotFoundError):
            return {
                "check": f"kernel_module_{module_name}",
                "passed": False,
                "detail": "lsmod command failed",
            }

    def _check_ofed(self):
        """Check OFED installation."""
        try:
            result = subprocess.run(
                ["ofed_info", "-s"],
                capture_output=True, text=True, timeout=10,
            )
            version = result.stdout.strip()
            return {
                "check": "ofed_installation",
                "passed": bool(version),
                "detail": f"OFED version: {version}",
            }
        except (subprocess.TimeoutExpired, FileNotFoundError):
            return {
                "check": "ofed_installation",
                "passed": False,
                "detail": "OFED not installed or ofed_info not found",
            }

    def _check_pcie_topology(self):
        """
        Check GPU-NIC PCIe topology.
        GPUDirect RDMA is most efficient when GPU and NIC
        are on the same PCIe switch (same NUMA node).
        """
        try:
            result = subprocess.run(
                ["nvidia-smi", "topo", "-m"],
                capture_output=True, text=True, timeout=10,
            )
            # Parse topology matrix
            output = result.stdout
            # Look for SYS connections (worst) vs PIX/PHB (best)
            has_sys = "SYS" in output
            return {
                "check": "pcie_topology",
                "passed": not has_sys,
                "detail": (
                    "All GPU-NIC pairs are on the same PCIe switch"
                    if not has_sys else
                    "WARNING: Some GPU-NIC pairs cross NUMA nodes (SYS). "
                    "GPUDirect RDMA will be slower for these pairs."
                ),
                "topology_matrix": output[:500],
            }
        except (subprocess.TimeoutExpired, FileNotFoundError):
            return {
                "check": "pcie_topology",
                "passed": False,
                "detail": "nvidia-smi topo command failed",
            }

    def _check_nccl_gdr(self):
        """Check NCCL GPUDirect RDMA support."""
        gdr_level = os.environ.get("NCCL_NET_GDR_LEVEL", "auto")
        return {
            "check": "nccl_gdr_level",
            "passed": gdr_level in ("5", "auto"),
            "detail": f"NCCL_NET_GDR_LEVEL={gdr_level}. "
                      f"Set to 5 for full GPUDirect RDMA.",
        }

    def get_optimal_gpu_nic_mapping(self):
        """
        Determine the optimal GPU-to-NIC mapping based on
        PCIe topology.

        Each GPU should use the InfiniBand NIC that is
        topologically closest (same PCIe switch).
        """
        # In a DGX H100:
        # GPU 0-3: connected to NIC 0-3 (same PCIe switch)
        # GPU 4-7: connected to NIC 4-7 (same PCIe switch)
        mapping = {}
        for gpu_id in range(8):
            nic_id = gpu_id  # 1:1 mapping in DGX
            mapping[f"GPU_{gpu_id}"] = f"mlx5_{nic_id}"

        return {
            "mapping": mapping,
            "nccl_env": {
                "NCCL_IB_HCA": "mlx5",
                "NCCL_TOPO_FILE": "/var/run/nvidia-topologyd/virtualTopology.xml",
            },
        }

All-Reduce Bandwidth: GPUDirect RDMA vs CPU Staging

Metric 0.1141664256
GPUDirect RDMA (IB NDR)
2
15
32
42
46
48
CPU Staging (IB NDR)
0.5
5
15
25
32
38
GPUDirect RDMA (RoCE 100G)
0.8
5
9
11
11.5
12

Cross-Rack Communication

Topology-Aware Placement

When tensor parallelism spans multiple racks, the communication pattern must account for the higher latency and potentially lower bandwidth of cross-rack InfiniBand links.

class TopologyAwarePlacement:
    """
    Place model partitions (TP groups) on GPUs with optimal
    network topology.

    Rules:
    1. TP groups should be within a single node (NVSwitch)
       whenever possible.
    2. If TP spans nodes, prefer same-rack placement.
    3. Pipeline parallelism can span racks (lower communication
       frequency).
    """

    def __init__(self):
        self.nodes = {}  # node_id -> {rack_id, gpus, ib_ports}

    def add_node(self, node_id, rack_id, n_gpus=8, n_ib_ports=8):
        """Register a node in the cluster."""
        self.nodes[node_id] = {
            "rack_id": rack_id,
            "n_gpus": n_gpus,
            "n_ib_ports": n_ib_ports,
            "allocated_gpus": 0,
        }

    def place_model(self, tp_degree, pp_degree, n_replicas=1):
        """
        Place a model with given parallelism across the cluster.

        tp_degree: tensor parallelism degree (GPUs per TP group)
        pp_degree: pipeline parallelism degree (TP groups per pipeline)
        n_replicas: number of independent model replicas
        """
        total_gpus_per_replica = tp_degree * pp_degree
        total_gpus = total_gpus_per_replica * n_replicas

        placements = []

        for replica_id in range(n_replicas):
            replica_placement = {
                "replica_id": replica_id,
                "tp_groups": [],
                "pp_stages": [],
            }

            for pp_stage in range(pp_degree):
                # Find the best node for this TP group
                tp_group = self._find_tp_placement(tp_degree)
                if tp_group is None:
                    raise RuntimeError(
                        f"Cannot place TP group: insufficient GPUs"
                    )

                replica_placement["tp_groups"].append(tp_group)
                replica_placement["pp_stages"].append(pp_stage)

            placements.append(replica_placement)

        # Validate: check that PP stages are in the same rack
        for placement in placements:
            self._validate_pp_locality(placement)

        return placements

    def _find_tp_placement(self, tp_degree):
        """
        Find the best placement for a TP group.
        Preference order:
        1. Single node (NVSwitch, best)
        2. Same rack, multiple nodes (IB, good)
        3. Cross rack (IB with higher latency, worst)
        """
        # Try single-node first
        for node_id, node in self.nodes.items():
            available = node["n_gpus"] - node["allocated_gpus"]
            if available >= tp_degree:
                # Allocate on this node
                gpu_start = node["allocated_gpus"]
                node["allocated_gpus"] += tp_degree
                return {
                    "nodes": [node_id],
                    "gpus": list(range(gpu_start, gpu_start + tp_degree)),
                    "topology": "intra_node",
                    "expected_allreduce_us": 20,  # NVSwitch
                }

        # Try same-rack multi-node
        racks = {}
        for node_id, node in self.nodes.items():
            rack = node["rack_id"]
            if rack not in racks:
                racks[rack] = []
            available = node["n_gpus"] - node["allocated_gpus"]
            if available > 0:
                racks[rack].append((node_id, available))

        for rack_id, rack_nodes in racks.items():
            total_available = sum(a for _, a in rack_nodes)
            if total_available >= tp_degree:
                # Allocate across nodes in this rack
                allocated_nodes = []
                remaining = tp_degree
                for node_id, available in rack_nodes:
                    take = min(remaining, available)
                    allocated_nodes.append((node_id, take))
                    self.nodes[node_id]["allocated_gpus"] += take
                    remaining -= take
                    if remaining == 0:
                        break

                return {
                    "nodes": [n for n, _ in allocated_nodes],
                    "gpus_per_node": {n: t for n, t in allocated_nodes},
                    "topology": "intra_rack",
                    "expected_allreduce_us": 350,  # InfiniBand same-rack
                }

        return None  # Cannot place

    def _validate_pp_locality(self, placement):
        """
        Validate that pipeline stages are close together.
        PP communication is point-to-point and less frequent
        but should still avoid cross-rack if possible.
        """
        racks = set()
        for tp_group in placement["tp_groups"]:
            for node_id in tp_group["nodes"]:
                racks.add(self.nodes[node_id]["rack_id"])

        if len(racks) > 1:
            placement["warning"] = (
                f"Pipeline stages span {len(racks)} racks. "
                f"This adds {(len(racks) - 1) * 3}us latency "
                f"per pipeline bubble."
            )

NCCL Benchmark

Measuring Actual Collective Performance

class NCCLBenchmark:
    """
    Benchmark NCCL collective operations to verify
    that network configuration is optimal.

    Measures all-reduce, all-gather, and reduce-scatter
    at various message sizes to build a performance profile.
    """

    def __init__(self):
        self.results = []

    def generate_benchmark_script(self, n_gpus, message_sizes_mb=None):
        """
        Generate a shell script that runs NCCL benchmarks.
        Uses nccl-tests (https://github.com/NVIDIA/nccl-tests).
        """
        if message_sizes_mb is None:
            message_sizes_mb = [0.1, 1, 4, 16, 64, 256]

        message_sizes_bytes = [int(m * 1024 * 1024) for m in message_sizes_mb]

        script = f"""#!/bin/bash
# NCCL Benchmark for LLM inference optimization
# Run on {n_gpus} GPUs

set -e

# Set NCCL parameters for inference
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_BUFFSIZE=2097152
export NCCL_NTHREADS=256
export NCCL_MIN_NCHANNELS=2
export NCCL_MAX_NCHANNELS=4
export NCCL_NET_GDR_LEVEL=5
export NCCL_CROSS_NIC=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET

echo "=== NCCL Configuration ==="
env | grep NCCL | sort

echo ""
echo "=== All-Reduce Benchmark ==="
for SIZE in {' '.join(str(s) for s in message_sizes_bytes)}; do
    echo "--- Message size: $SIZE bytes ---"
    mpirun -np {n_gpus} \\
        --bind-to none \\
        ./build/all_reduce_perf \\
        -b $SIZE -e $SIZE \\
        -d float -o sum \\
        -n 100 -w 20 \\
        -g 1
    echo ""
done

echo ""
echo "=== All-Gather Benchmark ==="
for SIZE in {' '.join(str(s) for s in message_sizes_bytes)}; do
    echo "--- Message size: $SIZE bytes ---"
    mpirun -np {n_gpus} \\
        --bind-to none \\
        ./build/all_gather_perf \\
        -b $SIZE -e $SIZE \\
        -d float \\
        -n 100 -w 20 \\
        -g 1
    echo ""
done

echo ""
echo "=== Reduce-Scatter Benchmark ==="
for SIZE in {' '.join(str(s) for s in message_sizes_bytes)}; do
    echo "--- Message size: $SIZE bytes ---"
    mpirun -np {n_gpus} \\
        --bind-to none \\
        ./build/reduce_scatter_perf \\
        -b $SIZE -e $SIZE \\
        -d float -o sum \\
        -n 100 -w 20 \\
        -g 1
    echo ""
done

echo "=== Benchmark Complete ==="
"""
        return script

    def parse_results(self, benchmark_output):
        """
        Parse nccl-tests output to extract bandwidth
        and latency numbers.
        """
        results = []
        current_op = ""

        for line in benchmark_output.split("\n"):
            if "All-Reduce" in line:
                current_op = "all_reduce"
            elif "All-Gather" in line:
                current_op = "all_gather"
            elif "Reduce-Scatter" in line:
                current_op = "reduce_scatter"

            # Parse result lines (format: size count type ... busbw)
            parts = line.strip().split()
            if len(parts) >= 8 and parts[0].isdigit():
                try:
                    results.append({
                        "operation": current_op,
                        "size_bytes": int(parts[0]),
                        "count": int(parts[1]),
                        "time_us": float(parts[5]),
                        "algo_bw_gbps": float(parts[6]),
                        "bus_bw_gbps": float(parts[7]),
                    })
                except (ValueError, IndexError):
                    continue

        return results

    def analyze_results(self, results):
        """
        Analyze benchmark results against expected performance.
        """
        analysis = []

        for r in results:
            size_mb = r["size_bytes"] / (1024 * 1024)

            # Expected bus bandwidth for InfiniBand NDR (8 ports)
            if size_mb >= 16:
                expected_bw = 40.0  # GB/s with GPUDirect
            elif size_mb >= 1:
                expected_bw = 25.0
            else:
                expected_bw = 10.0

            efficiency = r["bus_bw_gbps"] / expected_bw * 100

            analysis.append({
                "operation": r["operation"],
                "size_mb": round(size_mb, 1),
                "bus_bw_gbps": round(r["bus_bw_gbps"], 1),
                "expected_bw_gbps": expected_bw,
                "efficiency_pct": round(efficiency, 1),
                "healthy": efficiency >= 70,
            })

        return analysis
📊

Expected NCCL Performance Targets (8xH100 DGX, InfiniBand NDR)

OperationSizeExpected Bus BWMin AcceptableTypical Issue If Below
All-reduce 16 MB 40+ GB/s 30 GB/s Wrong NCCL_ALGO or missing GPUDirect
All-reduce 1 MB 20+ GB/s 10 GB/s Wrong NCCL_PROTO or too few channels
All-reduce 0.1 MB 5+ GB/s 2 GB/s Missing LL protocol for small messages
All-gather 16 MB 42+ GB/s 32 GB/s NCCL_CROSS_NIC not enabled
Reduce-scatter 16 MB 42+ GB/s 32 GB/s PCIe topology mismatch
Note: Bus bandwidth normalizes for the number of GPUs and represents the effective per-GPU bandwidth.

Key Takeaways

Network optimization for LLM inference is different from training. Inference communication patterns are smaller, more frequent, and more latency-sensitive. The default NCCL configuration is tuned for training and leaves performance on the table for inference.

The critical optimizations:

  1. NCCL_ALGO=Ring for inference: Ring all-reduce outperforms Tree for the 8-64 MB messages typical in inference. This single change provides 15-30% improvement in all-reduce latency.

  2. GPUDirect RDMA eliminates CPU copies: Without GPUDirect, every cross-node message is copied GPU -> CPU -> NIC -> NIC -> CPU -> GPU. With GPUDirect: GPU -> NIC -> NIC -> GPU. Latency improvement: 30-50% for small messages, 15-25% for large messages.

  3. Topology-aware placement is essential: TP groups should be within a single node (NVSwitch, 900 GB/s) whenever possible. If they must span nodes, same-rack (InfiniBand, 50 GB/s per link) is 10x faster than cross-rack with an extra switch hop.

  4. NCCL channel count trades SM resources for communication bandwidth: For inference, 2-4 channels are sufficient. Training uses 8-16. Reducing channels frees 2-4 SMs per GPU for model computation, a 2-5% throughput improvement for compute-bound inference.

  5. Benchmark before and after: Run nccl-tests at all message sizes relevant to your model’s communication pattern. Compare against expected bandwidth for your hardware. If actual bandwidth is below 70% of expected, there is a configuration issue.

The latency budget: for Llama 70B with 4-way TP serving at 30 tokens/second, each decode step must complete in 33 ms. The all-reduce takes 0.3 ms on NVSwitch, 5.2 ms on InfiniBand (same rack), or 5.5 ms cross-rack. Within a node, network overhead is under 1% of the step budget. Across nodes, it is 15-17%. This is why intra-node TP is strongly preferred for inference latency.