Introduction
Every distributed training job — whether it is data-parallel gradient synchronization, tensor-parallel all-reduce within a layer, or pipeline-parallel point-to-point transfers — depends on NVIDIA’s Collective Communications Library (NCCL) for its GPU-to-GPU communication. NCCL is the invisible plumbing that determines whether your 256-GPU training run scales at 85% efficiency or 55% efficiency. The difference is often not your model or your framework but whether NCCL is using the right algorithm, the right protocol, and the right network paths for your specific hardware topology.
This post covers NCCL’s internals at the level needed to actually diagnose and fix performance problems: how it selects algorithms, how it detects topology, what the tuning environment variables do (and when to override NCCL’s defaults), how SHARP in-network computing changes the picture, and how to troubleshoot the dreaded NCCL hang.
NCCL Algorithm Selection: Ring vs Tree vs CollNet
NCCL implements several algorithms for collective operations. The choice of algorithm dramatically affects both latency and bandwidth utilization, and NCCL’s automatic selection does not always get it right.
Ring All-Reduce
The ring algorithm organizes GPUs into a logical ring. For an all-reduce of a buffer of size bytes:
- The buffer is split into chunks of size .
- In the reduce-scatter phase, each GPU sends one chunk to its ring neighbor and receives one chunk from the other neighbor, accumulating (reducing) as it goes. After steps, each GPU holds the fully reduced version of one chunk.
- In the all-gather phase, each GPU sends its fully reduced chunk around the ring. After another steps, every GPU has the complete reduced result.
The total data transferred per GPU is , which approaches for large . This is optimal — no algorithm can do better in terms of total data volume. The bandwidth achieved by a ring on a message of size with link bandwidth is:
where is the time to complete the all-reduce.
Strengths: Optimal bandwidth utilization for large messages. Simple, predictable behavior. Works well on any topology.
Weaknesses: Latency scales as because data must traverse all GPUs. For small messages, the per-step latency overhead dominates.
Tree All-Reduce
The tree algorithm organizes GPUs into a binary (or higher-arity) tree:
- In the reduce phase, leaf nodes send their data to their parent, which reduces it with its own data and propagates upward. After steps, the root has the fully reduced result.
- In the broadcast phase, the root sends the reduced result back down the tree.
The total data transferred per GPU is (same as ring), but the latency is instead of .
Strengths: Logarithmic latency — critical for large GPU counts and smaller messages. Naturally maps to hierarchical topologies (intra-node NVLink + inter-node IB).
Weaknesses: Not all links are used simultaneously in every step, so actual bandwidth utilization can be lower than ring for medium-sized messages on uniform topologies. Tree construction must match the physical topology for best results.
CollNet (Collective Network / SHARP)
CollNet is NCCL’s interface to in-network computing hardware, specifically Mellanox/NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). With SHARP, the network switches themselves perform the reduction:
- Each GPU sends its data to the top-of-rack (ToR) switch via InfiniBand.
- The switches perform the reduction in their ASICs.
- The reduced result is sent back to all GPUs.
This halves the network traffic (no need for a separate all-gather phase — the switches broadcast the result directly) and achieves sub-microsecond reduction latency within the switch fabric.
Strengths: Highest bandwidth for large clusters with SHARP-capable switches. Offloads compute from GPUs.
Weaknesses: Requires specific Mellanox ConnectX-6 or later NICs and Quantum switches with SHARP enabled. Not available on most cloud providers. Configuration complexity is high.
All-Reduce Algorithm Comparison (8x A100 NVLink, NCCL 2.21)
| Message Size | Ring BW (GB/s) | Tree BW (GB/s) | Ring Latency (us) | Tree Latency (us) | Better |
|---|---|---|---|---|---|
| 8 KB | 0.9 | 1.8 | 9.1 | 4.5 | |
| 64 KB | 5.2 | 8.1 | 12.3 | 7.9 | |
| 512 KB | 28.4 | 31.2 | 18.0 | 16.4 | |
| 4 MB | 95.3 | 88.7 | 42.0 | 45.1 | |
| 32 MB | 142.1 | 138.5 | 225 | 231 | |
| 256 MB | 155.2 | 152.8 | 1,649 | 1,675 |
On typical 8-GPU NVLink nodes, tree outperforms ring for messages smaller than approximately 1-4 MB, and ring matches or slightly beats tree for larger messages. NCCL’s default threshold (NCCL_TREE_THRESHOLD) is usually set reasonably, but for workloads dominated by small all-reduces (e.g., tensor parallelism with TP=8), explicitly forcing tree with NCCL_ALGO=TREE can provide 10-20% improvement.
Topology Detection
NCCL’s performance depends critically on understanding the physical connectivity between GPUs. At initialization, NCCL builds a topology graph by probing the system.
What NCCL Discovers
NCCL constructs a graph of:
- GPU-to-GPU NVLink connections: Which GPUs are directly connected, how many NVLink lanes, and the bandwidth per lane. On DGX A100, each GPU has 12 NVLink 3.0 lanes at 25 GB/s each for 600 GB/s total per GPU (bidirectional).
- PCIe topology: Which GPUs share a PCIe switch, which share a CPU socket, and how many PCIe hops separate each GPU pair.
- NIC affinity: Which InfiniBand or Ethernet NICs are closest (fewest PCIe hops) to each GPU. This determines which NIC a GPU uses for inter-node communication.
- NUMA topology: Which CPU NUMA node each GPU and NIC is attached to. Cross-NUMA memory access adds latency.
You can inspect this topology yourself:
# Show GPU interconnect matrix
nvidia-smi topo -m
# Example output on DGX A100:
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx0 mlx1
# GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS NODE
# GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE SYS
# ...
#
# NV12 = 12 NVLink connections
# NODE = same NUMA node, no NVLink
# SYS = cross NUMA node, no NVLink
# Show NVLink status and throughput counters
nvidia-smi nvlink --status
nvidia-smi nvlink --capabilities
How Topology Affects Algorithm Selection
NCCL uses the topology graph to:
-
Construct optimal rings: For ring all-reduce, NCCL finds rings that maximize the minimum bandwidth across all links in the ring. On NVLink-connected GPUs, this means routing the ring through NVLink paths and avoiding PCIe. On a DGX A100, NCCL typically creates multiple rings (channels) that use different NVLink lanes to achieve aggregate bandwidth.
-
Build tree structures: For tree all-reduce, NCCL maps the tree to the physical hierarchy: intra-node NVLink forms the bottom of the tree, inter-node InfiniBand forms the top. This two-level structure is natural for multi-node training.
-
Select NIC-GPU affinity: For inter-node communication, NCCL assigns each GPU to communicate through the NIC that is physically closest (fewest PCIe hops). On DGX A100 with 8 NICs and 8 GPUs, this is a 1:1 mapping. On systems with fewer NICs, multiple GPUs share a NIC, creating a potential bottleneck.
NCCL’s topology detection can fail or produce suboptimal results when: (1) the system has non-standard PCIe topology that confuses the probing, (2) NVLink is present but a link is degraded (running at reduced lane count), or (3) GPU-NIC affinity is wrong because the BIOS routes PCIe lanes differently than expected. Always verify with NCCL_DEBUG=INFO and check for warnings about NVLink detection or NIC affinity.
NVLink vs InfiniBand vs RoCE
The interconnect technology determines the achievable bandwidth and latency floor for NCCL operations.
NVLink (Intra-Node)
NVLink is NVIDIA’s proprietary GPU-to-GPU interconnect, used within a single node:
| Generation | Per-Lane BW | Lanes per GPU | Aggregate BW/GPU | Systems |
|---|---|---|---|---|
| NVLink 2.0 | 25 GB/s | 6 | 300 GB/s | V100 DGX-1 |
| NVLink 3.0 | 25 GB/s | 12 | 600 GB/s | A100 DGX A100 |
| NVLink 4.0 | 25 GB/s | 18 | 900 GB/s | H100 DGX H100 |
| NVSwitch 3.0 | - | - | 900 GB/s | H100 fully connected |
NVLink provides the highest bandwidth and lowest latency for intra-node communication. On NVSwitch-connected systems (DGX H100), all 8 GPUs are fully connected — every GPU has a direct NVLink path to every other GPU. On older systems without NVSwitch, GPUs are connected in a partial mesh, and some pairs must communicate through intermediate GPUs.
InfiniBand (Inter-Node, RDMA)
InfiniBand is the standard for high-performance inter-node communication in GPU clusters:
| Speed | Link BW | Common Config | Effective BW |
|---|---|---|---|
| HDR | 200 Gb/s (25 GB/s) | 1 NIC/GPU | ~23 GB/s |
| HDR | 200 Gb/s | 2 NIC/GPU (bonded) | ~45 GB/s |
| NDR | 400 Gb/s (50 GB/s) | 1 NIC/GPU | ~47 GB/s |
| NDR | 400 Gb/s | 2 NIC/GPU (bonded) | ~90 GB/s |
NCCL uses InfiniBand through the verbs API with GPU Direct RDMA (GDR), which allows the NIC to read/write GPU memory directly without staging through CPU memory. GDR is critical for performance — without it, every inter-node transfer requires a GPU-to-CPU copy, a network transfer, and a CPU-to-GPU copy on the receiving side.
# Verify GPU Direct RDMA is working
# This should show GDR capability for each NIC-GPU pair
nvidia-smi topo -m | grep -E "GPU|mlx"
# Check InfiniBand link status
ibstat
# Verify port is Active, rate matches expected (e.g., 200 Gb/s for HDR)
# Bandwidth test
ib_write_bw -d mlx5_0 -F --report_gbits -D 5
# Should achieve greater than 180 Gb/s on HDR
RoCE (RDMA over Converged Ethernet)
RoCE provides RDMA semantics over Ethernet, used in cloud environments where InfiniBand is not available (e.g., some AWS, GCP, Azure configurations):
RoCEv2 runs over UDP/IP, enabling routing across L3 networks. Performance:
- Typical bandwidth: 80-90% of InfiniBand at the same link speed
- Latency: 2-5 us higher than InfiniBand
- Congestion sensitivity: RoCE performance degrades sharply under network congestion because it relies on lossless Ethernet (PFC/ECN), which is harder to maintain in shared networks
NCCL automatically detects RoCE NICs and uses them, but the congestion sensitivity means that tuning is more important on RoCE than on InfiniBand.
Inter-Node All-Reduce Bandwidth (2 nodes, 16 GPUs total, 256 MB message)
| Interconnect | Bus BW (GB/s) | Algo BW (GB/s) | Latency (ms) | Efficiency vs Theory |
|---|---|---|---|---|
| NVLink (intra-node baseline) | 155 | 138 | 1.65 | 86% |
| IB HDR (1 NIC/GPU) | 21.5 | 19.1 | 13.4 | 83% |
| IB NDR (1 NIC/GPU) | 43.2 | 38.4 | 6.7 | 85% |
| RoCE 100 GbE | 10.8 | 9.6 | 26.7 | 77% |
| RoCE 200 GbE | 19.5 | 17.3 | 14.8 | 78% |
| IB NDR + SHARP | 58.1 | 51.7 | 4.4 | 90% |
Tuning Environment Variables
NCCL exposes dozens of environment variables for tuning. Most should be left at their defaults, but several are critical for specific scenarios.
Algorithm and Protocol Selection
# Force a specific algorithm
export NCCL_ALGO=RING # Options: RING, TREE, COLLNET
# Default: NCCL auto-selects based on message size and topology
# Force a specific protocol
export NCCL_PROTO=SIMPLE # Options: LL, LL128, SIMPLE
# LL = Low Latency: 8-byte inline data, best for tiny messages (less than 8 KB)
# LL128 = Low Latency 128: 128-byte chunks, good for small-medium (8 KB - 512 KB)
# SIMPLE = Bulk transfer: highest bandwidth for large messages (greater than 512 KB)
# Default: NCCL auto-selects based on message size
When to override: If your workload is dominated by a specific message size range and NCCL’s automatic selection is suboptimal. For example, tensor parallelism with TP=8 on a 70B model produces all-reduce messages of ~8 MB per layer. This is in the range where tree and ring are competitive; profiling with each forced algorithm will tell you which is faster on your specific hardware.
Channel Configuration
# Minimum number of channels (parallel communication streams)
export NCCL_MIN_NCHANNELS=4
# Default: 1 for small messages, scales up automatically
# Maximum number of channels
export NCCL_MAX_NCHANNELS=32
# Default: varies by GPU generation (typically 8-16)
# More channels = more parallelism = higher bandwidth for large messages
# But also more memory usage for channel buffers
# NCCL threads per channel
export NCCL_NTHREADS=512
# Default: 256-512 depending on GPU
# More threads can help saturate high-bandwidth links
When to override: If nccl-tests shows bandwidth well below theoretical, try increasing NCCL_MIN_NCHANNELS. On H100 with NVSwitch, setting NCCL_MIN_NCHANNELS=8 or higher can improve small-message performance by ensuring enough channels are active from the start.
Network Configuration
# GPU Direct RDMA level
export NCCL_NET_GDR_LEVEL=5
# 0 = disabled
# 1 = within same PCI root
# 2 = within same CPU socket
# 3 = across sockets (only on PIX/PHB)
# 5 = always use GDR regardless of topology
# Default: auto-detected
# Force to 5 if GDR is available but NCCL incorrectly falls back to CPU staging
# Network interface selection
export NCCL_SOCKET_IFNAME=^lo,docker0
# Excludes loopback and docker interfaces
# Use specific interface: NCCL_SOCKET_IFNAME=eth0
# InfiniBand HCA selection
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
# Explicitly list which HCAs to use
# Useful when some HCAs are reserved for storage traffic
# Disable specific transports
export NCCL_IB_DISABLE=0 # 1 to disable IB (use TCP/sockets instead)
export NCCL_P2P_DISABLE=0 # 1 to disable NVLink/PCIe peer-to-peer
export NCCL_SHM_DISABLE=0 # 1 to disable shared memory (intra-node)
Buffer and Timing
# Communication buffer size per channel
export NCCL_BUFFSIZE=8388608 # 8 MB (default is 4 MB in recent versions)
# Larger buffers can improve bandwidth for large messages
# But consume more GPU memory (buffers * channels * 2 for send/recv)
# Cross-NIC bandwidth (used for CollNet/SHARP planning)
export NCCL_CROSS_NIC=1 # Allow cross-NIC communication patterns
# Timeout for NCCL operations (in seconds)
export NCCL_TIMEOUT=1800 # 30 minutes (default varies)
# Increase for very large all-reduces that may take longer than expected
Impact of Tuning Variables (8x H100 NVLink, 256 MB all-reduce)
| Configuration | Bus BW (GB/s) | Change vs Default | Notes |
|---|---|---|---|
| Default (NCCL auto) | 165 | baseline | NCCL 2.21 |
| NCCL_ALGO=TREE | 161 | -2.4% | Tree slightly worse at 256 MB |
| NCCL_ALGO=RING | 167 | +1.2% | Ring marginally better here |
| NCCL_MIN_NCHANNELS=16 | 172 | +4.2% | More parallelism helps |
| NCCL_BUFFSIZE=16M | 174 | +5.5% | Larger buffers, more BW |
| MIN_NCHANNELS=16 + BUFFSIZE=16M | 178 | +7.9% | Combined effect |
| + NCCL_NTHREADS=512 | 179 | +8.5% | Diminishing returns |
SHARP In-Network Computing
SHARP offloads collective operations to the network switches, fundamentally changing the communication pattern.
How SHARP Works
In a traditional all-reduce, each GPU sends data over the network and receives the reduced result — the network carries bytes of traffic. With SHARP:
- Each GPU sends its data to the local switch ( bytes upstream).
- The switch reduces the data from all connected ports in its ASIC.
- The switch sends the reduced result back to all GPUs ( bytes downstream).
Total network traffic is halved to , and the reduction happens at switch line rate with sub-microsecond latency.
SHARP Requirements
- Switches: NVIDIA Quantum or Quantum-2 (InfiniBand) with SHARP support
- NICs: ConnectX-6 or later
- Software: SHARP daemon running on switches, HCOLL library on nodes
- NCCL: Must be built with SHARP/CollNet support
# Enable SHARP in NCCL
export NCCL_COLLNET_ENABLE=1
export SHARP_COLL_LOG_LEVEL=3 # For debugging
# Verify SHARP is active (look in NCCL_DEBUG=INFO output)
# NCCL INFO Trees [0] -1/-1/-1->0->1 [1] -1/-1/-1->0->-1 ... CollNet
# The "CollNet" tag indicates SHARP is being used
SHARP Performance Impact
All-Reduce Bandwidth: IB NDR vs IB NDR + SHARP (32 GPUs, 4 nodes)
(GB/s algorithm bandwidth)The fact that SHARP can achieve higher bandwidth than the theoretical single-link maximum seems paradoxical. The explanation is that SHARP eliminates the all-gather phase: instead of each GPU receiving the reduced result from its ring/tree neighbor (which requires data to traverse multiple hops), the switch broadcasts the result directly to all ports simultaneously. This effectively doubles the usable bandwidth compared to a two-phase algorithm.
Real Benchmarks: All-Reduce at Different Message Sizes
The following benchmarks use nccl-tests (all_reduce_perf) across a range of message sizes to show how NCCL performance varies. Understanding this curve is essential for predicting how your training workload will perform.
Single Node (8x H100 SXM, NVLink 4.0 + NVSwitch)
Intra-Node All-Reduce (8x H100 NVSwitch)
| Message Size | Bus BW (GB/s) | Algo BW (GB/s) | Latency (us) | Protocol Used |
|---|---|---|---|---|
| 256 B | 0.03 | 0.03 | 8.2 | LL |
| 4 KB | 0.45 | 0.39 | 8.9 | LL |
| 64 KB | 5.8 | 5.1 | 11.0 | LL128 |
| 256 KB | 24.1 | 21.1 | 10.6 | LL128 |
| 1 MB | 78.5 | 68.7 | 12.8 | Simple |
| 4 MB | 148 | 130 | 27.0 | Simple |
| 16 MB | 198 | 173 | 81.0 | Simple |
| 64 MB | 225 | 197 | 283 | Simple |
| 256 MB | 238 | 208 | 1,076 | Simple |
| 1 GB | 241 | 211 | 4,254 | Simple |
Multi-Node (4 nodes, 32x H100, IB NDR 400 Gb/s)
Inter-Node All-Reduce (32x H100, 4 nodes, IB NDR)
| Message Size | Bus BW (GB/s) | Algo BW (GB/s) | Latency (us) | Algorithm |
|---|---|---|---|---|
| 256 B | 0.01 | 0.01 | 28 | Tree |
| 4 KB | 0.12 | 0.11 | 34 | Tree |
| 64 KB | 1.8 | 1.7 | 36 | Tree |
| 256 KB | 6.5 | 6.3 | 40 | Tree |
| 1 MB | 18.2 | 17.6 | 56 | Tree |
| 4 MB | 32.5 | 31.4 | 125 | Ring |
| 16 MB | 40.1 | 38.8 | 405 | Ring |
| 64 MB | 44.8 | 43.4 | 1,451 | Ring |
| 256 MB | 46.2 | 44.7 | 5,628 | Ring |
| 1 GB | 46.8 | 45.3 | 21,985 | Ring |
Key Observations
The benchmarks reveal several important patterns:
-
Latency floor: Even for tiny messages, NCCL cannot achieve sub-8-us latency intra-node or sub-28-us inter-node. This is the fixed overhead of kernel launch, synchronization, and network setup. For all-reduce messages smaller than ~64 KB, you are latency-bound regardless of bandwidth.
-
Bandwidth ramp: Bandwidth increases with message size because the fixed latency overhead is amortized over more data. The bandwidth saturates around 256 MB - 1 GB for both intra-node and inter-node.
-
The intra/inter gap: Intra-node bandwidth (241 GB/s) is ~5x inter-node bandwidth (47 GB/s) with NDR InfiniBand. This gap dictates the optimal parallelism strategy: keep the highest-bandwidth communication (tensor parallelism all-reduce) intra-node, and push only the lower-bandwidth communication (data parallelism gradient sync) across nodes.
-
Algorithm crossover: For inter-node communication, NCCL switches from tree to ring around 1-4 MB. The tree’s logarithmic latency advantage matters for small messages; the ring’s optimal bandwidth utilization matters for large messages.
Troubleshooting NCCL Hangs
NCCL hangs are the most painful distributed training failure mode. The job appears to be running (GPUs are allocated, processes are alive) but makes no progress. There is no error message, no crash, no stack trace — just silence.
Common Causes
Rank desynchronization: One rank enters a collective operation while another rank enters a different collective, or enters the same collective with different parameters. NCCL has no timeout by default (though recent versions added NCCL_TIMEOUT), so mismatched collectives hang forever.
# Enable NCCL debug logging to see which collective each rank is executing
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,COLL,NET
# Look for ranks entering different operations:
# rank 0: ncclAllReduce(sendbuff=0x..., count=1048576, ...)
# rank 1: ncclAllReduce(sendbuff=0x..., count=524288, ...) # DIFFERENT COUNT!
Network failures: A network link goes down mid-operation. InfiniBand link flaps, RoCE PFC storms, or switch port errors can cause one rank to stall waiting for data that will never arrive.
# Check IB error counters
perfquery -x # Extended port counters
# Look for:
# PortRcvErrors, PortRcvRemotePhysicalErrors
# PortXmitDiscards, ExcessiveBufferOverrunErrors
# LinkErrorRecoveryCounter, LinkDownedCounter
# Check for link flaps
ibdiagnet # Comprehensive IB fabric diagnostics
GPU memory OOM during NCCL init: NCCL allocates communication buffers during initialization. If GPU memory is nearly full, the allocation can fail silently (in older NCCL versions) or cause a hang during the first collective.
Firewall or security group blocking: For socket-based communication (the bootstrap connection that coordinates NCCL initialization), firewalls can block the required TCP ports.
Diagnostic Procedure
- Set
NCCL_DEBUG=INFOand redirect output to per-rank log files:
export NCCL_DEBUG=INFO
export NCCL_DEBUG_FILE=/tmp/nccl_debug_rank_%r.log
# %r is replaced by the rank number
- Set
NCCL_TIMEOUTto get an error instead of a hang:
export NCCL_TIMEOUT=300 # 5-minute timeout
# After timeout, NCCL will abort with an error message
# indicating which operation timed out and on which rank
- Use
py-spyorgdbto inspect hanging processes:
# Attach to the hanging Python process
py-spy dump --pid <PID>
# Look for the stack trace -- is it stuck in:
# ncclAllReduce? -> communication hang
# cudaStreamSynchronize? -> GPU kernel hang
# at init_process_group? -> bootstrap connection issue
# For C++ level debugging
gdb -p <PID>
# (gdb) thread apply all bt
# Look for threads stuck in poll(), ibv_poll_cq(), or ncclProxyService
- Check for asymmetric failures by examining logs from all ranks:
# Compare NCCL init across ranks
grep "NCCL INFO" /tmp/nccl_debug_rank_*.log | sort
# All ranks should report the same:
# - Number of channels
# - Algorithm (Ring/Tree)
# - Number of NVLink/IB connections
# Discrepancies indicate topology detection issues
By default, NCCL operations are asynchronous — they are enqueued on a CUDA stream and may not execute immediately. If a NCCL error occurs during an async operation, the error is not reported until the next synchronization point (e.g., cudaStreamSynchronize or another NCCL call). This means the actual error may have occurred hundreds of milliseconds before the hang is observed. Setting NCCL_DEBUG=WARN at minimum ensures errors are logged immediately.
Common Fixes
NCCL Hang Causes and Solutions
| Cause | Symptom | Diagnosis | Fix |
|---|---|---|---|
| Rank mismatch | All ranks hang at collective | NCCL_DEBUG shows different ops | Fix application logic |
| IB link down | Some ranks hang, others timeout | ibstat shows port down | Replace cable, reseat NIC |
| PFC storm (RoCE) | Intermittent hangs under load | Switch ECN/PFC counters spiking | Tune ECN thresholds, isolate traffic |
| GPU OOM | Hang during first collective | nvidia-smi shows near 100% memory | Reduce model/batch, increase NCCL buffer awareness |
| Firewall | Hang during NCCL init | Timeout on bootstrap TCP connect | Open ports 29400-29500 |
| CUDA error on one rank | Asymmetric hang | NCCL_DEBUG shows CUDA error on one rank | Fix underlying CUDA issue |
| NVLink degraded | Intra-node hang or slowdown | nvidia-smi nvlink shows errors | Reseat GPU, update driver |
Communication-Computation Overlap
Achieving high training efficiency requires hiding communication behind computation. In data-parallel training with DDP, this means starting the gradient all-reduce for early layers while the backward pass is still computing gradients for later layers.
How PyTorch DDP Implements Overlap
PyTorch DDP groups parameters into buckets (default 25 MB each) in reverse parameter order. When the gradient for the last parameter in a bucket is computed, DDP immediately launches the all-reduce for that bucket while the backward pass continues computing gradients for the next bucket.
# Tuning bucket size for overlap
model = DDP(
model,
device_ids=[local_rank],
bucket_cap_mb=25, # Default 25 MB
# Smaller buckets = more overlap opportunities but more kernel launches
# Larger buckets = fewer launches but less overlap
gradient_as_bucket_view=True, # Avoid gradient copy into bucket
static_graph=True, # Enable optimizations for static computation graphs
)
The optimal bucket size depends on the ratio of computation time to communication time per layer:
- If computation per layer is long relative to all-reduce time, larger buckets are fine because there is ample time to overlap.
- If computation per layer is short (e.g., small models, large batch sizes where compute is fast), smaller buckets enable finer-grained overlap.
Measuring Overlap Efficiency
# Profile with Nsight Systems to visualize overlap
nsys profile \
--trace=cuda,nvtx,nccl \
--output=training_profile \
python train.py
# In the timeline view, look for:
# NCCL kernels (labeled ncclKernel_AllReduce_*)
# running concurrently with backward pass GEMM kernels
# Gaps where the GPU is idle waiting for NCCL -- these are overlap failures
The ideal profile shows NCCL kernels running on a separate CUDA stream, fully overlapped with backward pass computation. The worst case shows a sequential pattern: backward pass completes, then all-reduce runs, then the next forward pass begins.
Training Iteration Time Breakdown (GPT-3 175B, 64x H100)
(ms)Advanced: Hierarchical All-Reduce for Multi-Node
For large clusters, NCCL automatically implements a two-level hierarchical all-reduce:
-
Intra-node reduce-scatter: Each GPU reduces with its NVLink-connected peers using the high-bandwidth NVLink fabric. After this step, each GPU holds a partial reduction over 1/8th of the data (for 8 GPUs per node).
-
Inter-node all-reduce: Each GPU all-reduces its partial data with the corresponding GPU on other nodes via InfiniBand. The message size per GPU for this step is , which is 8x smaller than the original message.
-
Intra-node all-gather: Each GPU broadcasts its fully reduced partial data to its NVLink peers.
This hierarchical decomposition keeps the bulk of the data transfer on the fast NVLink fabric and minimizes the data that must cross the slower inter-node network. For an 8-GPU-per-node configuration, the inter-node traffic is reduced by 8x compared to a flat algorithm.
Flat vs Hierarchical All-Reduce (64 GPUs, 8 nodes, 256 MB message)
| Approach | Inter-Node Traffic/GPU | Total Time (ms) | Bus BW (GB/s) |
|---|---|---|---|
| Flat ring (all 64 GPUs) | 256 MB | 12.8 | 40.2 |
| Hierarchical (8 intra + 8 inter) | 32 MB | 5.1 | 100.8 |
| Hierarchical + SHARP inter | 32 MB | 3.8 | 135.2 |
Practical Tuning Workflow
Here is the systematic process for tuning NCCL for a new cluster or workload:
Step 1: Baseline with nccl-tests
# Build nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 NCCL_HOME=/usr/lib/x86_64-linux-gnu CUDA_HOME=/usr/local/cuda
# Run comprehensive benchmark
mpirun -np 8 --bind-to numa \
./build/all_reduce_perf \
-b 8 -e 1G -f 2 -g 1 \
-w 5 -n 100
# -b 8: start at 8 bytes
# -e 1G: end at 1 GB
# -f 2: multiply by 2 each step
# -g 1: 1 GPU per process
# -w 5: 5 warmup iterations
# -n 100: 100 measurement iterations
Step 2: Compare Against Theoretical Bandwidth
Calculate your expected bandwidth:
- Intra-node: NVLink bandwidth per GPU (e.g., 900 GB/s for H100 NVSwitch), multiplied by the algorithm efficiency factor (typically 0.85-0.90 for large messages).
- Inter-node: IB/RoCE bandwidth per GPU (e.g., 50 GB/s for NDR), multiplied by efficiency (typically 0.80-0.90).
If measured bandwidth is significantly below expected, investigate topology, NIC affinity, and GDR status.
Step 3: Tune for Your Workload’s Message Size
Determine the actual all-reduce message sizes in your training workload:
# For DDP: message size = bucket size (default 25 MB)
# For FSDP: message size = shard size = total_params / world_size * dtype_bytes
# For tensor parallelism: message size = hidden_dim * batch * seq_len * dtype_bytes / TP
# Example: Llama 70B with TP=8, batch=1, seq=4096, hidden=8192, FP16
tp_allreduce_size = 8192 * 1 * 4096 * 2 / 1 # ~64 MB per all-reduce
# (Not divided by TP because the all-reduce operates on the full hidden dim output)
Then run nccl-tests at that specific message size and tune accordingly.
Step 4: Validate with Actual Training
# Run training with NCCL profiling
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL # Only log collective operations
# Parse logs for timing
grep "AllReduce" nccl_debug.log | awk '{print $NF}'
# Compare against nccl-tests baseline to verify there is no
# application-level overhead degrading NCCL performance
Conclusion
NCCL performance tuning is not glamorous work, but it directly impacts the dollar cost of every training run. The key takeaways:
-
Understand your topology. Run
nvidia-smi topo -m, verify NVLink connections, check NIC-GPU affinity. Topology misunderstandings are the root cause of most performance issues. -
Know your message sizes. The optimal algorithm, protocol, and buffer configuration all depend on the message size distribution of your workload. Measure it, do not guess.
-
Benchmark before and after. Use
nccl-teststo establish a baseline, tune variables, and measure the impact. Do not tune blind. -
The hierarchy matters. For multi-node training, the intra-node/inter-node bandwidth gap means that parallelism strategy (what communication goes where) matters more than NCCL tuning variables.
-
SHARP is transformative when available. If you have SHARP-capable hardware, enabling CollNet can provide 30-50% bandwidth improvement for inter-node all-reduce. Investigate whether your cluster supports it.
-
Hangs are diagnosable. Set
NCCL_DEBUG=INFO, setNCCL_TIMEOUT, and examine per-rank logs. Most hangs come from rank desynchronization, network failures, or memory issues — all of which leave traces in the debug output.