NVIDIA NCCL Performance Tuning for Multi-GPU Training

Part of Series Inference Optimization Timeline 55 of 60

1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Introduction

Every distributed training job — whether it is data-parallel gradient synchronization, tensor-parallel all-reduce within a layer, or pipeline-parallel point-to-point transfers — depends on NVIDIA’s Collective Communications Library (NCCL) for its GPU-to-GPU communication. NCCL is the invisible plumbing that determines whether your 256-GPU training run scales at 85% efficiency or 55% efficiency. The difference is often not your model or your framework but whether NCCL is using the right algorithm, the right protocol, and the right network paths for your specific hardware topology.

This post covers NCCL’s internals at the level needed to actually diagnose and fix performance problems: how it selects algorithms, how it detects topology, what the tuning environment variables do (and when to override NCCL’s defaults), how SHARP in-network computing changes the picture, and how to troubleshoot the dreaded NCCL hang.

NCCL Algorithm Selection: Ring vs Tree vs CollNet

NCCL implements several algorithms for collective operations. The choice of algorithm dramatically affects both latency and bandwidth utilization, and NCCL’s automatic selection does not always get it right.

Ring All-Reduce

The ring algorithm organizes $N$ GPUs into a logical ring. For an all-reduce of a buffer of size $D$ bytes:

The buffer is split into $N$ chunks of size $D/N$ .
In the reduce-scatter phase, each GPU sends one chunk to its ring neighbor and receives one chunk from the other neighbor, accumulating (reducing) as it goes. After $N-1$ steps, each GPU holds the fully reduced version of one chunk.
In the all-gather phase, each GPU sends its fully reduced chunk around the ring. After another $N-1$ steps, every GPU has the complete reduced result.

The total data transferred per GPU is $2 \times \frac{N-1}{N} \times D$ , which approaches $2D$ for large $N$ . This is optimal — no algorithm can do better in terms of total data volume. The bandwidth achieved by a ring on a message of size $D$ with link bandwidth $B$ is:

$\text{Bus bandwidth} = \frac{D}{t} \times \frac{2(N-1)}{N}$

where $t$ is the time to complete the all-reduce.

Strengths: Optimal bandwidth utilization for large messages. Simple, predictable behavior. Works well on any topology.

Weaknesses: Latency scales as $O(N)$ because data must traverse all $N$ GPUs. For small messages, the per-step latency overhead dominates.

Tree All-Reduce

The tree algorithm organizes GPUs into a binary (or higher-arity) tree:

In the reduce phase, leaf nodes send their data to their parent, which reduces it with its own data and propagates upward. After $\log_2(N)$ steps, the root has the fully reduced result.
In the broadcast phase, the root sends the reduced result back down the tree.

The total data transferred per GPU is $2D$ (same as ring), but the latency is $O(\log N)$ instead of $O(N)$ .

Strengths: Logarithmic latency — critical for large GPU counts and smaller messages. Naturally maps to hierarchical topologies (intra-node NVLink + inter-node IB).

Weaknesses: Not all links are used simultaneously in every step, so actual bandwidth utilization can be lower than ring for medium-sized messages on uniform topologies. Tree construction must match the physical topology for best results.

CollNet (Collective Network / SHARP)

CollNet is NCCL’s interface to in-network computing hardware, specifically Mellanox/NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). With SHARP, the network switches themselves perform the reduction:

Each GPU sends its data to the top-of-rack (ToR) switch via InfiniBand.
The switches perform the reduction in their ASICs.
The reduced result is sent back to all GPUs.

This halves the network traffic (no need for a separate all-gather phase — the switches broadcast the result directly) and achieves sub-microsecond reduction latency within the switch fabric.

Strengths: Highest bandwidth for large clusters with SHARP-capable switches. Offloads compute from GPUs.

Weaknesses: Requires specific Mellanox ConnectX-6 or later NICs and Quantum switches with SHARP enabled. Not available on most cloud providers. Configuration complexity is high.

📊

All-Reduce Algorithm Comparison (8x A100 NVLink, NCCL 2.21)

Message Size	Ring BW (GB/s)	Tree BW (GB/s)	Ring Latency (us)	Tree Latency (us)
8 KB	0.9	1.8	9.1	4.5
64 KB	5.2	8.1	12.3	7.9
512 KB	28.4	31.2	18.0	16.4
4 MB	95.3	88.7	42.0	45.1
32 MB	142.1	138.5	225	231
256 MB	155.2	152.8	1,649	1,675

Note: Bus bandwidth = algoBW * 2(N-1)/N. Measured with nccl-tests all_reduce_perf.

⚡ The Crossover Point

On typical 8-GPU NVLink nodes, tree outperforms ring for messages smaller than approximately 1-4 MB, and ring matches or slightly beats tree for larger messages. NCCL’s default threshold (NCCL_TREE_THRESHOLD) is usually set reasonably, but for workloads dominated by small all-reduces (e.g., tensor parallelism with TP=8), explicitly forcing tree with NCCL_ALGO=TREE can provide 10-20% improvement.

Topology Detection

NCCL’s performance depends critically on understanding the physical connectivity between GPUs. At initialization, NCCL builds a topology graph by probing the system.

What NCCL Discovers

NCCL constructs a graph of:

GPU-to-GPU NVLink connections: Which GPUs are directly connected, how many NVLink lanes, and the bandwidth per lane. On DGX A100, each GPU has 12 NVLink 3.0 lanes at 25 GB/s each for 600 GB/s total per GPU (bidirectional).
PCIe topology: Which GPUs share a PCIe switch, which share a CPU socket, and how many PCIe hops separate each GPU pair.
NIC affinity: Which InfiniBand or Ethernet NICs are closest (fewest PCIe hops) to each GPU. This determines which NIC a GPU uses for inter-node communication.
NUMA topology: Which CPU NUMA node each GPU and NIC is attached to. Cross-NUMA memory access adds latency.

You can inspect this topology yourself:

# Show GPU interconnect matrix
nvidia-smi topo -m

# Example output on DGX A100:
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7  mlx0  mlx1
# GPU0     X    NV12  NV12  NV12  NV12  NV12  NV12  NV12  SYS   NODE
# GPU1    NV12   X    NV12  NV12  NV12  NV12  NV12  NV12  NODE  SYS
# ...
#
# NV12 = 12 NVLink connections
# NODE = same NUMA node, no NVLink
# SYS  = cross NUMA node, no NVLink

# Show NVLink status and throughput counters
nvidia-smi nvlink --status
nvidia-smi nvlink --capabilities

How Topology Affects Algorithm Selection

NCCL uses the topology graph to:

Construct optimal rings: For ring all-reduce, NCCL finds rings that maximize the minimum bandwidth across all links in the ring. On NVLink-connected GPUs, this means routing the ring through NVLink paths and avoiding PCIe. On a DGX A100, NCCL typically creates multiple rings (channels) that use different NVLink lanes to achieve aggregate bandwidth.
Build tree structures: For tree all-reduce, NCCL maps the tree to the physical hierarchy: intra-node NVLink forms the bottom of the tree, inter-node InfiniBand forms the top. This two-level structure is natural for multi-node training.
Select NIC-GPU affinity: For inter-node communication, NCCL assigns each GPU to communicate through the NIC that is physically closest (fewest PCIe hops). On DGX A100 with 8 NICs and 8 GPUs, this is a 1:1 mapping. On systems with fewer NICs, multiple GPUs share a NIC, creating a potential bottleneck.

⚠️ Topology Detection Failures

NCCL’s topology detection can fail or produce suboptimal results when: (1) the system has non-standard PCIe topology that confuses the probing, (2) NVLink is present but a link is degraded (running at reduced lane count), or (3) GPU-NIC affinity is wrong because the BIOS routes PCIe lanes differently than expected. Always verify with NCCL_DEBUG=INFO and check for warnings about NVLink detection or NIC affinity.

NVLink vs InfiniBand vs RoCE

The interconnect technology determines the achievable bandwidth and latency floor for NCCL operations.

NVLink (Intra-Node)

NVLink is NVIDIA’s proprietary GPU-to-GPU interconnect, used within a single node:

Generation	Per-Lane BW	Lanes per GPU	Aggregate BW/GPU	Systems
NVLink 2.0	25 GB/s	6	300 GB/s	V100 DGX-1
NVLink 3.0	25 GB/s	12	600 GB/s	A100 DGX A100
NVLink 4.0	25 GB/s	18	900 GB/s	H100 DGX H100
NVSwitch 3.0	-	-	900 GB/s	H100 fully connected

NVLink provides the highest bandwidth and lowest latency for intra-node communication. On NVSwitch-connected systems (DGX H100), all 8 GPUs are fully connected — every GPU has a direct NVLink path to every other GPU. On older systems without NVSwitch, GPUs are connected in a partial mesh, and some pairs must communicate through intermediate GPUs.

InfiniBand (Inter-Node, RDMA)

InfiniBand is the standard for high-performance inter-node communication in GPU clusters:

Speed	Link BW	Common Config	Effective BW
HDR	200 Gb/s (25 GB/s)	1 NIC/GPU	~23 GB/s
HDR	200 Gb/s	2 NIC/GPU (bonded)	~45 GB/s
NDR	400 Gb/s (50 GB/s)	1 NIC/GPU	~47 GB/s
NDR	400 Gb/s	2 NIC/GPU (bonded)	~90 GB/s

NCCL uses InfiniBand through the verbs API with GPU Direct RDMA (GDR), which allows the NIC to read/write GPU memory directly without staging through CPU memory. GDR is critical for performance — without it, every inter-node transfer requires a GPU-to-CPU copy, a network transfer, and a CPU-to-GPU copy on the receiving side.

# Verify GPU Direct RDMA is working
# This should show GDR capability for each NIC-GPU pair
nvidia-smi topo -m | grep -E "GPU|mlx"

# Check InfiniBand link status
ibstat
# Verify port is Active, rate matches expected (e.g., 200 Gb/s for HDR)

# Bandwidth test
ib_write_bw -d mlx5_0 -F --report_gbits -D 5
# Should achieve greater than 180 Gb/s on HDR

RoCE (RDMA over Converged Ethernet)

RoCE provides RDMA semantics over Ethernet, used in cloud environments where InfiniBand is not available (e.g., some AWS, GCP, Azure configurations):

RoCEv2 runs over UDP/IP, enabling routing across L3 networks. Performance:

Typical bandwidth: 80-90% of InfiniBand at the same link speed
Latency: 2-5 us higher than InfiniBand
Congestion sensitivity: RoCE performance degrades sharply under network congestion because it relies on lossless Ethernet (PFC/ECN), which is harder to maintain in shared networks

NCCL automatically detects RoCE NICs and uses them, but the congestion sensitivity means that tuning is more important on RoCE than on InfiniBand.

📊

Inter-Node All-Reduce Bandwidth (2 nodes, 16 GPUs total, 256 MB message)

Interconnect	Bus BW (GB/s)	Algo BW (GB/s)	Latency (ms)	Efficiency vs Theory
NVLink (intra-node baseline)	155	138	1.65	86%
IB HDR (1 NIC/GPU)	21.5	19.1	13.4	83%
IB NDR (1 NIC/GPU)	43.2	38.4	6.7	85%
RoCE 100 GbE	10.8	9.6	26.7	77%
RoCE 200 GbE	19.5	17.3	14.8	78%
IB NDR + SHARP	58.1	51.7	4.4	90%

Note: Bus BW = algoBW * 2(N-1)/N. Measured with nccl-tests all_reduce_perf -b 256M -e 256M.

Tuning Environment Variables

NCCL exposes dozens of environment variables for tuning. Most should be left at their defaults, but several are critical for specific scenarios.

Algorithm and Protocol Selection

# Force a specific algorithm
export NCCL_ALGO=RING    # Options: RING, TREE, COLLNET
# Default: NCCL auto-selects based on message size and topology

# Force a specific protocol
export NCCL_PROTO=SIMPLE  # Options: LL, LL128, SIMPLE
# LL     = Low Latency: 8-byte inline data, best for tiny messages (less than 8 KB)
# LL128  = Low Latency 128: 128-byte chunks, good for small-medium (8 KB - 512 KB)
# SIMPLE = Bulk transfer: highest bandwidth for large messages (greater than 512 KB)
# Default: NCCL auto-selects based on message size

When to override: If your workload is dominated by a specific message size range and NCCL’s automatic selection is suboptimal. For example, tensor parallelism with TP=8 on a 70B model produces all-reduce messages of ~8 MB per layer. This is in the range where tree and ring are competitive; profiling with each forced algorithm will tell you which is faster on your specific hardware.

Channel Configuration

# Minimum number of channels (parallel communication streams)
export NCCL_MIN_NCHANNELS=4
# Default: 1 for small messages, scales up automatically

# Maximum number of channels
export NCCL_MAX_NCHANNELS=32
# Default: varies by GPU generation (typically 8-16)
# More channels = more parallelism = higher bandwidth for large messages
# But also more memory usage for channel buffers

# NCCL threads per channel
export NCCL_NTHREADS=512
# Default: 256-512 depending on GPU
# More threads can help saturate high-bandwidth links

When to override: If nccl-tests shows bandwidth well below theoretical, try increasing NCCL_MIN_NCHANNELS. On H100 with NVSwitch, setting NCCL_MIN_NCHANNELS=8 or higher can improve small-message performance by ensuring enough channels are active from the start.

Network Configuration

# GPU Direct RDMA level
export NCCL_NET_GDR_LEVEL=5
# 0 = disabled
# 1 = within same PCI root
# 2 = within same CPU socket
# 3 = across sockets (only on PIX/PHB)
# 5 = always use GDR regardless of topology
# Default: auto-detected
# Force to 5 if GDR is available but NCCL incorrectly falls back to CPU staging

# Network interface selection
export NCCL_SOCKET_IFNAME=^lo,docker0
# Excludes loopback and docker interfaces
# Use specific interface: NCCL_SOCKET_IFNAME=eth0

# InfiniBand HCA selection
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
# Explicitly list which HCAs to use
# Useful when some HCAs are reserved for storage traffic

# Disable specific transports
export NCCL_IB_DISABLE=0    # 1 to disable IB (use TCP/sockets instead)
export NCCL_P2P_DISABLE=0   # 1 to disable NVLink/PCIe peer-to-peer
export NCCL_SHM_DISABLE=0   # 1 to disable shared memory (intra-node)

Buffer and Timing

# Communication buffer size per channel
export NCCL_BUFFSIZE=8388608  # 8 MB (default is 4 MB in recent versions)
# Larger buffers can improve bandwidth for large messages
# But consume more GPU memory (buffers * channels * 2 for send/recv)

# Cross-NIC bandwidth (used for CollNet/SHARP planning)
export NCCL_CROSS_NIC=1  # Allow cross-NIC communication patterns

# Timeout for NCCL operations (in seconds)
export NCCL_TIMEOUT=1800  # 30 minutes (default varies)
# Increase for very large all-reduces that may take longer than expected

📊

Impact of Tuning Variables (8x H100 NVLink, 256 MB all-reduce)

Configuration	Bus BW (GB/s)	Change vs Default	Notes
Default (NCCL auto)	165	baseline	NCCL 2.21
NCCL_ALGO=TREE	161	-2.4%	Tree slightly worse at 256 MB
NCCL_ALGO=RING	167	+1.2%	Ring marginally better here
NCCL_MIN_NCHANNELS=16	172	+4.2%	More parallelism helps
NCCL_BUFFSIZE=16M	174	+5.5%	Larger buffers, more BW
MIN_NCHANNELS=16 + BUFFSIZE=16M	178	+7.9%	Combined effect
+ NCCL_NTHREADS=512	179	+8.5%	Diminishing returns

Note: Always validate tuning with your actual workload, not just nccl-tests.

SHARP In-Network Computing

SHARP offloads collective operations to the network switches, fundamentally changing the communication pattern.

How SHARP Works

In a traditional all-reduce, each GPU sends data over the network and receives the reduced result — the network carries $2 \times D \times (N-1)/N$ bytes of traffic. With SHARP:

Each GPU sends its data to the local switch ( $D$ bytes upstream).
The switch reduces the data from all connected ports in its ASIC.
The switch sends the reduced result back to all GPUs ( $D$ bytes downstream).

Total network traffic is halved to $2D$ , and the reduction happens at switch line rate with sub-microsecond latency.

SHARP Requirements

Switches: NVIDIA Quantum or Quantum-2 (InfiniBand) with SHARP support
NICs: ConnectX-6 or later
Software: SHARP daemon running on switches, HCOLL library on nodes
NCCL: Must be built with SHARP/CollNet support

# Enable SHARP in NCCL
export NCCL_COLLNET_ENABLE=1
export SHARP_COLL_LOG_LEVEL=3  # For debugging

# Verify SHARP is active (look in NCCL_DEBUG=INFO output)
# NCCL INFO Trees [0] -1/-1/-1->0->1 [1] -1/-1/-1->0->-1 ... CollNet
# The "CollNet" tag indicates SHARP is being used

SHARP Performance Impact

All-Reduce Bandwidth: IB NDR vs IB NDR + SHARP (32 GPUs, 4 nodes)

(GB/s algorithm bandwidth)

IB NDR (tree) Standard

38 GB/s algorithm bandwidth

IB NDR + SHARP In-network reduction

52 GB/s algorithm bandwidth

+36.8%

Theoretical max Link bandwidth limit

50 GB/s algorithm bandwidth

+31.6%

ℹ️ SHARP Exceeds Single-Link Bandwidth

The fact that SHARP can achieve higher bandwidth than the theoretical single-link maximum seems paradoxical. The explanation is that SHARP eliminates the all-gather phase: instead of each GPU receiving the reduced result from its ring/tree neighbor (which requires data to traverse multiple hops), the switch broadcasts the result directly to all ports simultaneously. This effectively doubles the usable bandwidth compared to a two-phase algorithm.

Real Benchmarks: All-Reduce at Different Message Sizes

The following benchmarks use nccl-tests (all_reduce_perf) across a range of message sizes to show how NCCL performance varies. Understanding this curve is essential for predicting how your training workload will perform.

Single Node (8x H100 SXM, NVLink 4.0 + NVSwitch)

📊

Intra-Node All-Reduce (8x H100 NVSwitch)

Message Size	Bus BW (GB/s)	Algo BW (GB/s)	Latency (us)	Protocol Used
256 B	0.03	0.03	8.2	LL
4 KB	0.45	0.39	8.9	LL
64 KB	5.8	5.1	11.0	LL128
256 KB	24.1	21.1	10.6	LL128
1 MB	78.5	68.7	12.8	Simple
4 MB	148	130	27.0	Simple
16 MB	198	173	81.0	Simple
64 MB	225	197	283	Simple
256 MB	238	208	1,076	Simple
1 GB	241	211	4,254	Simple

Note: Bus BW approaches 241 GB/s, which is ~85% of the 283 GB/s theoretical bidirectional NVLink 4.0 bandwidth per GPU. Protocol auto-selected by NCCL.

Multi-Node (4 nodes, 32x H100, IB NDR 400 Gb/s)

📊

Inter-Node All-Reduce (32x H100, 4 nodes, IB NDR)

Message Size	Bus BW (GB/s)	Algo BW (GB/s)	Latency (us)	Algorithm
256 B	0.01	0.01	28	Tree
4 KB	0.12	0.11	34	Tree
64 KB	1.8	1.7	36	Tree
256 KB	6.5	6.3	40	Tree
1 MB	18.2	17.6	56	Tree
4 MB	32.5	31.4	125	Ring
16 MB	40.1	38.8	405	Ring
64 MB	44.8	43.4	1,451	Ring
256 MB	46.2	44.7	5,628	Ring
1 GB	46.8	45.3	21,985	Ring

Note: Inter-node bandwidth saturates near 47 GB/s (NDR 400 Gb/s = 50 GB/s, ~94% efficiency). Algorithm auto-selected.

Key Observations

The benchmarks reveal several important patterns:

Latency floor: Even for tiny messages, NCCL cannot achieve sub-8-us latency intra-node or sub-28-us inter-node. This is the fixed overhead of kernel launch, synchronization, and network setup. For all-reduce messages smaller than ~64 KB, you are latency-bound regardless of bandwidth.
Bandwidth ramp: Bandwidth increases with message size because the fixed latency overhead is amortized over more data. The bandwidth saturates around 256 MB - 1 GB for both intra-node and inter-node.
The intra/inter gap: Intra-node bandwidth (241 GB/s) is ~5x inter-node bandwidth (47 GB/s) with NDR InfiniBand. This gap dictates the optimal parallelism strategy: keep the highest-bandwidth communication (tensor parallelism all-reduce) intra-node, and push only the lower-bandwidth communication (data parallelism gradient sync) across nodes.
Algorithm crossover: For inter-node communication, NCCL switches from tree to ring around 1-4 MB. The tree’s logarithmic latency advantage matters for small messages; the ring’s optimal bandwidth utilization matters for large messages.

Troubleshooting NCCL Hangs

NCCL hangs are the most painful distributed training failure mode. The job appears to be running (GPUs are allocated, processes are alive) but makes no progress. There is no error message, no crash, no stack trace — just silence.

Common Causes

Rank desynchronization: One rank enters a collective operation while another rank enters a different collective, or enters the same collective with different parameters. NCCL has no timeout by default (though recent versions added NCCL_TIMEOUT), so mismatched collectives hang forever.

# Enable NCCL debug logging to see which collective each rank is executing
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,COLL,NET
# Look for ranks entering different operations:
# rank 0: ncclAllReduce(sendbuff=0x..., count=1048576, ...)
# rank 1: ncclAllReduce(sendbuff=0x..., count=524288, ...)   # DIFFERENT COUNT!

Network failures: A network link goes down mid-operation. InfiniBand link flaps, RoCE PFC storms, or switch port errors can cause one rank to stall waiting for data that will never arrive.

# Check IB error counters
perfquery -x  # Extended port counters
# Look for:
#   PortRcvErrors, PortRcvRemotePhysicalErrors
#   PortXmitDiscards, ExcessiveBufferOverrunErrors
#   LinkErrorRecoveryCounter, LinkDownedCounter

# Check for link flaps
ibdiagnet  # Comprehensive IB fabric diagnostics

GPU memory OOM during NCCL init: NCCL allocates communication buffers during initialization. If GPU memory is nearly full, the allocation can fail silently (in older NCCL versions) or cause a hang during the first collective.

Firewall or security group blocking: For socket-based communication (the bootstrap connection that coordinates NCCL initialization), firewalls can block the required TCP ports.

Diagnostic Procedure

Set NCCL_DEBUG=INFO and redirect output to per-rank log files:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_FILE=/tmp/nccl_debug_rank_%r.log
# %r is replaced by the rank number

Set NCCL_TIMEOUT to get an error instead of a hang:

export NCCL_TIMEOUT=300  # 5-minute timeout
# After timeout, NCCL will abort with an error message
# indicating which operation timed out and on which rank

Use py-spy or gdb to inspect hanging processes:

# Attach to the hanging Python process
py-spy dump --pid <PID>
# Look for the stack trace -- is it stuck in:
#   ncclAllReduce? -> communication hang
#   cudaStreamSynchronize? -> GPU kernel hang
#   at init_process_group? -> bootstrap connection issue

# For C++ level debugging
gdb -p <PID>
# (gdb) thread apply all bt
# Look for threads stuck in poll(), ibv_poll_cq(), or ncclProxyService

Check for asymmetric failures by examining logs from all ranks:

# Compare NCCL init across ranks
grep "NCCL INFO" /tmp/nccl_debug_rank_*.log | sort
# All ranks should report the same:
#   - Number of channels
#   - Algorithm (Ring/Tree)
#   - Number of NVLink/IB connections
# Discrepancies indicate topology detection issues

🚨 The Silent Hang Killer: Async Error Handling

By default, NCCL operations are asynchronous — they are enqueued on a CUDA stream and may not execute immediately. If a NCCL error occurs during an async operation, the error is not reported until the next synchronization point (e.g., cudaStreamSynchronize or another NCCL call). This means the actual error may have occurred hundreds of milliseconds before the hang is observed. Setting NCCL_DEBUG=WARN at minimum ensures errors are logged immediately.

Common Fixes

📊

NCCL Hang Causes and Solutions

Cause	Symptom	Diagnosis	Fix
Rank mismatch	All ranks hang at collective	NCCL_DEBUG shows different ops	Fix application logic
IB link down	Some ranks hang, others timeout	ibstat shows port down	Replace cable, reseat NIC
PFC storm (RoCE)	Intermittent hangs under load	Switch ECN/PFC counters spiking	Tune ECN thresholds, isolate traffic
GPU OOM	Hang during first collective	nvidia-smi shows near 100% memory	Reduce model/batch, increase NCCL buffer awareness
Firewall	Hang during NCCL init	Timeout on bootstrap TCP connect	Open ports 29400-29500
CUDA error on one rank	Asymmetric hang	NCCL_DEBUG shows CUDA error on one rank	Fix underlying CUDA issue
NVLink degraded	Intra-node hang or slowdown	nvidia-smi nvlink shows errors	Reseat GPU, update driver

Communication-Computation Overlap

Achieving high training efficiency requires hiding communication behind computation. In data-parallel training with DDP, this means starting the gradient all-reduce for early layers while the backward pass is still computing gradients for later layers.

How PyTorch DDP Implements Overlap

PyTorch DDP groups parameters into buckets (default 25 MB each) in reverse parameter order. When the gradient for the last parameter in a bucket is computed, DDP immediately launches the all-reduce for that bucket while the backward pass continues computing gradients for the next bucket.

# Tuning bucket size for overlap
model = DDP(
    model,
    device_ids=[local_rank],
    bucket_cap_mb=25,      # Default 25 MB
    # Smaller buckets = more overlap opportunities but more kernel launches
    # Larger buckets = fewer launches but less overlap
    gradient_as_bucket_view=True,  # Avoid gradient copy into bucket
    static_graph=True,     # Enable optimizations for static computation graphs
)

The optimal bucket size depends on the ratio of computation time to communication time per layer:

If computation per layer is long relative to all-reduce time, larger buckets are fine because there is ample time to overlap.
If computation per layer is short (e.g., small models, large batch sizes where compute is fast), smaller buckets enable finer-grained overlap.

Measuring Overlap Efficiency

# Profile with Nsight Systems to visualize overlap
nsys profile \
    --trace=cuda,nvtx,nccl \
    --output=training_profile \
    python train.py

# In the timeline view, look for:
# NCCL kernels (labeled ncclKernel_AllReduce_*)
#    running concurrently with backward pass GEMM kernels
# Gaps where the GPU is idle waiting for NCCL -- these are overlap failures

The ideal profile shows NCCL kernels running on a separate CUDA stream, fully overlapped with backward pass computation. The worst case shows a sequential pattern: backward pass completes, then all-reduce runs, then the next forward pass begins.

Training Iteration Time Breakdown (GPT-3 175B, 64x H100)

(ms)

Forward pass

285 ms

Backward pass

580 ms

All-reduce (exposed) Non-overlapped

45 ms

Optimizer step

35 ms

Data loading

5 ms

Advanced: Hierarchical All-Reduce for Multi-Node

For large clusters, NCCL automatically implements a two-level hierarchical all-reduce:

Intra-node reduce-scatter: Each GPU reduces with its NVLink-connected peers using the high-bandwidth NVLink fabric. After this step, each GPU holds a partial reduction over 1/8th of the data (for 8 GPUs per node).
Inter-node all-reduce: Each GPU all-reduces its partial data with the corresponding GPU on other nodes via InfiniBand. The message size per GPU for this step is $D/(\text{GPUs per node})$ , which is 8x smaller than the original message.
Intra-node all-gather: Each GPU broadcasts its fully reduced partial data to its NVLink peers.

This hierarchical decomposition keeps the bulk of the data transfer on the fast NVLink fabric and minimizes the data that must cross the slower inter-node network. For an 8-GPU-per-node configuration, the inter-node traffic is reduced by 8x compared to a flat algorithm.

📊

Flat vs Hierarchical All-Reduce (64 GPUs, 8 nodes, 256 MB message)

Approach	Inter-Node Traffic/GPU	Total Time (ms)	Bus BW (GB/s)
Flat ring (all 64 GPUs)	256 MB	12.8	40.2
Hierarchical (8 intra + 8 inter)	32 MB	5.1	100.8
Hierarchical + SHARP inter	32 MB	3.8	135.2

Note: Hierarchical reduces inter-node traffic by 8x (GPUs-per-node factor).

Practical Tuning Workflow

Here is the systematic process for tuning NCCL for a new cluster or workload:

Step 1: Baseline with nccl-tests

# Build nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 NCCL_HOME=/usr/lib/x86_64-linux-gnu CUDA_HOME=/usr/local/cuda

# Run comprehensive benchmark
mpirun -np 8 --bind-to numa \
    ./build/all_reduce_perf \
    -b 8 -e 1G -f 2 -g 1 \
    -w 5 -n 100
# -b 8: start at 8 bytes
# -e 1G: end at 1 GB
# -f 2: multiply by 2 each step
# -g 1: 1 GPU per process
# -w 5: 5 warmup iterations
# -n 100: 100 measurement iterations

Step 2: Compare Against Theoretical Bandwidth

Calculate your expected bandwidth:

Intra-node: NVLink bandwidth per GPU (e.g., 900 GB/s for H100 NVSwitch), multiplied by the algorithm efficiency factor (typically 0.85-0.90 for large messages).
Inter-node: IB/RoCE bandwidth per GPU (e.g., 50 GB/s for NDR), multiplied by efficiency (typically 0.80-0.90).

If measured bandwidth is significantly below expected, investigate topology, NIC affinity, and GDR status.

Step 3: Tune for Your Workload’s Message Size

Determine the actual all-reduce message sizes in your training workload:

# For DDP: message size = bucket size (default 25 MB)
# For FSDP: message size = shard size = total_params / world_size * dtype_bytes
# For tensor parallelism: message size = hidden_dim * batch * seq_len * dtype_bytes / TP

# Example: Llama 70B with TP=8, batch=1, seq=4096, hidden=8192, FP16
tp_allreduce_size = 8192 * 1 * 4096 * 2 / 1  # ~64 MB per all-reduce
# (Not divided by TP because the all-reduce operates on the full hidden dim output)

Then run nccl-tests at that specific message size and tune accordingly.

Step 4: Validate with Actual Training

# Run training with NCCL profiling
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL  # Only log collective operations

# Parse logs for timing
grep "AllReduce" nccl_debug.log | awk '{print $NF}'
# Compare against nccl-tests baseline to verify there is no
# application-level overhead degrading NCCL performance

Conclusion

NCCL performance tuning is not glamorous work, but it directly impacts the dollar cost of every training run. The key takeaways:

Understand your topology. Run nvidia-smi topo -m, verify NVLink connections, check NIC-GPU affinity. Topology misunderstandings are the root cause of most performance issues.
Know your message sizes. The optimal algorithm, protocol, and buffer configuration all depend on the message size distribution of your workload. Measure it, do not guess.
Benchmark before and after. Use nccl-tests to establish a baseline, tune variables, and measure the impact. Do not tune blind.
The hierarchy matters. For multi-node training, the intra-node/inter-node bandwidth gap means that parallelism strategy (what communication goes where) matters more than NCCL tuning variables.
SHARP is transformative when available. If you have SHARP-capable hardware, enabling CollNet can provide 30-50% bandwidth improvement for inter-node all-reduce. Investigate whether your cluster supports it.
Hangs are diagnosable. Set NCCL_DEBUG=INFO, set NCCL_TIMEOUT, and examine per-rank logs. Most hangs come from rank desynchronization, network failures, or memory issues — all of which leave traces in the debug output.