Part of Series GPU Hardware & AI Accelerators 3 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

Tensor parallelism splits every matrix multiply across multiple GPUs. After each split GEMM, the partial results must be combined via an all-reduce before the next layer can begin. For an 8-way tensor-parallel Llama 70B on H100s, each layer requires an all-reduce of approximately 17 MB. With 80 layers and 48 tokens per second, that is:

80×17 MB×48=65,280 MB/s=65.3 GB/s of sustained all-reduce throughput80 \times 17\text{ MB} \times 48 = 65{,}280 \text{ MB/s} = 65.3 \text{ GB/s of sustained all-reduce throughput}

Over PCIe Gen5 at 64 GB/s bidirectional, this would consume the entire bus — and PCIe all-reduce is far less efficient than point-to-point, so actual throughput would be perhaps 30-40 GB/s. Over NVLink at 900 GB/s bidirectional, it is 7% of available bandwidth. This is why NVLink exists: tensor parallelism is impractical over PCIe for any model that requires it.

This post covers the complete NVIDIA interconnect stack: NVLink generations 1.0 through 5.0, the NVSwitch topology that enables all-to-all communication, the NVL72 rack-scale fabric, PCIe as the fallback, and InfiniBand for inter-node communication. The focus is on what these interconnects mean for LLM serving throughput.

The Interconnect Hierarchy

GPU interconnect operates at four levels, each an order of magnitude slower than the previous:

📊

GPU Interconnect Bandwidth Hierarchy (H100 DGX System)

LevelTechnologyBandwidth (bidirectional)LatencyScope
On-chip (SM to SM) Crossbar / L2 ~12 TB/s (L2 BW) ~200 cycles Single GPU die
HBM HBM3 3,350 GB/s ~400-600 cycles GPU to local memory
NVLink (intra-node) NVLink 4.0 900 GB/s per GPU ~1-2 us 8 GPUs in one node
PCIe (intra-node) PCIe Gen5 x16 64 GB/s ~2-5 us GPU to CPU, GPU to NIC
InfiniBand (inter-node) NDR 400G 400 Gbps = 50 GB/s ~1-5 us Across nodes in cluster
Ethernet (inter-node) 400GbE RoCE 400 Gbps = 50 GB/s ~2-10 us Across nodes in cluster
Note: Bidirectional bandwidth. NVLink bandwidth is per-GPU; total node bisection bandwidth is higher. InfiniBand latency depends on switch hops.

Each level down represents a communication boundary that affects parallelism strategy. Tensor parallelism (splitting a single layer) works within NVLink domains. Pipeline parallelism (splitting across layers) can tolerate InfiniBand latencies. Expert parallelism (MoE routing) requires high bisection bandwidth within a node.

The first NVLink provided 4 links per GPU at 40 GB/s each (bidirectional), for a total of 160 GB/s per GPU. This was 5x the PCIe Gen3 x16 bandwidth of 32 GB/s. The P100 could connect to up to 4 other GPUs, but the topology was limited — a fully-connected 4-GPU setup required a dedicated NVLink switch or a mesh.

NVLink 2.0 doubled the link count to 6 per GPU at 50 GB/s each, for 300 GB/s per GPU. The V100 DGX-1 connected 8 GPUs in a hybrid cube-mesh topology — not fully connected, as 6 links could not connect to all 7 other GPUs. Some GPU pairs communicated via intermediate hops.

NVLink 3.0 provided 12 links per GPU at 50 GB/s each, for 600 GB/s per GPU. With 12 links and only 7 other GPUs in the node, the DGX A100 achieved a fully-connected topology — every GPU has a direct NVLink connection to every other GPU. This eliminated the routing overhead present in Volta’s hybrid topology.

The A100 also introduced NVSwitch 2.0 — a dedicated switch chip that connects all NVLink ports. Instead of point-to-point NVLink cables between GPUs, all links terminate at NVSwitch ASICs, which can route any GPU’s traffic to any other GPU. The DGX A100 uses 6 NVSwitch chips, each connecting to all 8 GPUs.

NVLink 4.0 increased per-link bandwidth to 50 GB/s (same as NVLink 3.0) but expanded to 18 links per GPU, for 900 GB/s per GPU. The DGX H100 uses NVSwitch 3.0 to provide full bisection bandwidth across all 8 GPUs. NVSwitch 3.0 also introduces in-network compute — the switch itself can perform reductions (sum, max) on data flowing through it, reducing all-reduce latency.

NVSwitch in-network reduction: During an all-reduce, instead of each GPU sending data to all others and then summing locally, the NVSwitch performs the partial sum as data passes through. This reduces the all-reduce from 2(N1)/N2(N-1)/N bandwidth steps to approximately 1 bandwidth step for large messages.

NVLink 5.0 doubles per-link bandwidth to 100 GB/s, with 18 links per GPU for 1,800 GB/s per GPU. The Blackwell NVSwitch 4.0 supports 72 GPU connections (up from 8), enabling the NVL72 — a single rack with 72 B200 GPUs connected as one coherent domain.

NVLink Bandwidth per GPU (Bidirectional, GB/s)

(GB/s)
NVLink 1.0 (P100) 4 links x 40 GB/s
160 GB/s
NVLink 2.0 (V100) 6 links x 50 GB/s
300 GB/s
+87.5%
NVLink 3.0 (A100) 12 links x 50 GB/s
600 GB/s
+275.0%
NVLink 4.0 (H100) 18 links x 50 GB/s
900 GB/s
+462.5%
NVLink 5.0 (B200) 18 links x 100 GB/s
1,800 GB/s
+1025.0%
📊

NVLink Generation Comparison

SpecNVLink 1.0NVLink 2.0NVLink 3.0NVLink 4.0NVLink 5.0
GPU P100 V100 A100 H100 B200
Links per GPU 4 6 12 18 18
BW per link (bidir.) 40 GB/s 50 GB/s 50 GB/s 50 GB/s 100 GB/s
Total BW per GPU 160 GB/s 300 GB/s 600 GB/s 900 GB/s 1,800 GB/s
Signaling NRZ 20 Gbps NRZ 25 Gbps NRZ 25 Gbps PAM4 50 Gbps PAM4 100 Gbps
Max GPUs (NVSwitch) N/A N/A 8 8 72
In-network compute No No No Yes Yes

NVSwitch: From Node-Scale to Rack-Scale

The Topology Problem

Without NVSwitch, GPUs connect point-to-point via NVLink cables. With N GPUs and L links per GPU, each GPU can directly connect to at most L peers. For 8 GPUs with 12 links each (A100), you can fully connect all 8 (each pair uses 1-2 links). But for 16, 32, or 72 GPUs, point-to-point is impossible — you would need hundreds of links per GPU.

NVSwitch solves this by acting as a crossbar switch. Each NVSwitch chip has 64 NVLink ports on Blackwell (NVSwitch 4.0). Multiple NVSwitch chips can be combined to build a multi-stage fat-tree topology.

NVSwitch 3.0 (Hopper DGX)

The DGX H100 uses 4 NVSwitch 3.0 chips. Each NVSwitch has 64 NVLink 4.0 ports (32 ports per direction). The 8 GPUs connect to all 4 NVSwitch chips, and each NVSwitch can route between any pair of connected GPUs.

Full bisection bandwidth for the 8-GPU domain:

4 switches×64 ports×50 GB/s per port/2=6,400 GB/s bisection4 \text{ switches} \times 64 \text{ ports} \times 50 \text{ GB/s per port} / 2 = 6{,}400 \text{ GB/s bisection}

This means any subset of GPUs can communicate with any other subset at 900 GB/s per GPU without contention. This is critical for all-reduce operations where all GPUs send and receive simultaneously.

NVSwitch 4.0 and NVL72 (Blackwell)

The NVL72 configuration connects 72 B200 GPUs (36 Grace-Blackwell superchips, where each superchip pairs 2 B200s with 1 Grace CPU) into a single NVSwitch domain. This uses 9 NVSwitch 4.0 trays, each containing multiple NVSwitch chips.

The total bisection bandwidth is approximately 130 TB/s. Every GPU can communicate with every other GPU at full NVLink 5.0 speed (1,800 GB/s per GPU) without routing through any CPU or PCIe bus.

For a 1.8 trillion parameter model (like GPT-4 class), the NVL72 provides:

72×192 GB=13,824 GB of aggregate HBM3e72 \times 192 \text{ GB} = 13{,}824 \text{ GB of aggregate HBM3e}

At FP4 (0.5 bytes per param), a 1.8T model occupies 900 GB — fitting across 5 GPUs. But with tensor parallelism across all 72 GPUs, each GPU holds only 12.5 GB of weights, and the aggregate bandwidth is:

72×8,000 GB/s=576 TB/s aggregate HBM bandwidth72 \times 8{,}000 \text{ GB/s} = 576 \text{ TB/s aggregate HBM bandwidth}

ℹ️ NVL72 vs Traditional Multi-Node

A traditional 9-node cluster with 8 H100 GPUs each has 72 GPUs total. But inter-node communication uses InfiniBand at 400 Gbps (50 GB/s) — 18x slower than NVLink. Tensor parallelism across nodes is impractical. The NVL72 eliminates this bottleneck by making all 72 GPUs appear as one NVLink domain. This enables 72-way tensor parallelism, which is meaningless for small models but transformative for trillion-parameter inference.

PCIe: The Fallback Path

Every NVIDIA data center GPU has a PCIe interface in addition to NVLink. PCIe serves as the connection between GPU and CPU, GPU and network interface cards (NICs), and GPU and NVMe storage.

📊

PCIe Generation Comparison

PCIe GenPer-lane BWx16 BW (bidir.)GPUEncoding
Gen3 1 GB/s 32 GB/s V100 128b/130b
Gen4 2 GB/s 64 GB/s A100 128b/130b
Gen5 4 GB/s 128 GB/s H100, B200 128b/130b
Gen6 8 GB/s 256 GB/s Future PAM4, FEC
Note: Bidirectional bandwidth. Effective bandwidth is approximately 97% of raw due to 128b/130b encoding. Gen5 x16 is 126 GB/s effective.

PCIe is 7-28x slower than NVLink on the same generation:

NVLink vs PCIe Bandwidth (per GPU, bidirectional)

(GB/s)
PCIe Gen3 x16 (V100) CPU-GPU bus
32 GB/s
NVLink 2.0 (V100) GPU-GPU
300 GB/s
+837.5%
PCIe Gen4 x16 (A100) CPU-GPU bus
64 GB/s
+100.0%
NVLink 3.0 (A100) GPU-GPU
600 GB/s
+1775.0%
PCIe Gen5 x16 (H100) CPU-GPU bus
128 GB/s
+300.0%
NVLink 4.0 (H100) GPU-GPU
900 GB/s
+2712.5%

When PCIe Matters

PCIe bandwidth matters in three scenarios:

  1. CPU offloading: When KV cache or model layers are partially offloaded to CPU memory (DeepSpeed ZeRO-Offload), data must traverse PCIe. At 128 GB/s (Gen5 x16), moving 1 GB of KV cache takes 7.8 ms — adding significant latency per token.

  2. Network I/O: Data received from remote nodes via InfiniBand flows through the NIC, across PCIe, to GPU memory. GPUDirect RDMA (GDR) allows InfiniBand NICs to write directly to GPU memory over PCIe, bypassing CPU memory entirely.

  3. Multi-GPU without NVLink: Some cloud instances provide multiple GPUs connected only via PCIe (no NVLink). Tensor parallelism over PCIe is 7-14x slower than over NVLink, making it practical only for pipeline parallelism or data parallelism.

InfiniBand: Inter-Node Communication

Within a node, NVLink handles GPU-to-GPU communication. Between nodes, InfiniBand (IB) or Ethernet handles the traffic. For LLM training and multi-node inference, inter-node bandwidth determines pipeline parallelism and data parallelism efficiency.

📊

InfiniBand Generations for AI Clusters

IB GenerationPer-port BWLinks per GPUEffective GPU BWDeployment
EDR (2014) 25 GB/s 1 25 GB/s DGX-1 (V100)
HDR (2019) 25 GB/s (per port) 1-2 25-50 GB/s DGX A100
NDR (2022) 50 GB/s (400G) 1 50 GB/s DGX H100
NDR 800G (2024) 100 GB/s 1 100 GB/s DGX B200
Note: Effective bandwidth per GPU assumes GPUDirect RDMA. Without GDR, CPU-mediated transfers reduce effective bandwidth by 30-50%.

GPUDirect RDMA

Without GPUDirect RDMA, inter-node GPU communication follows this path:

GPU0 (Node A) -> PCIe -> CPU memory (Node A) -> NIC -> Network ->
NIC -> CPU memory (Node B) -> PCIe -> GPU0 (Node B)

Two PCIe traversals and two CPU memory copies. GPUDirect RDMA eliminates the CPU memory copies:

GPU0 (Node A) -> PCIe -> NIC -> Network ->
NIC -> PCIe -> GPU0 (Node B)

The NIC reads directly from GPU memory and writes directly to GPU memory. This reduces latency by approximately 2-5 microseconds and doubles effective bandwidth (no CPU memory bottleneck).

On DGX H100 and newer, the ConnectX-7 InfiniBand NIC connects to the NVSwitch fabric rather than PCIe. This means inter-node transfers flow through NVLink instead of PCIe:

GPU0 (Node A) -> NVLink -> NVSwitch -> NIC -> Network ->
NIC -> NVSwitch -> NVLink -> GPU0 (Node B)

The NVLink path provides 900 GB/s (Hopper) vs 128 GB/s (PCIe Gen5), but the network itself is the bottleneck at 50-100 GB/s. The benefit of NVLink routing is lower latency and the ability to aggregate traffic from multiple GPUs through the same NIC without PCIe contention.

Why Interconnect Determines TP Efficiency

All-Reduce Cost Analysis

In tensor parallelism with NN GPUs, each layer requires two all-reduce operations: one after the column-parallel GEMM and one after the row-parallel GEMM. Each all-reduce transfers approximately MM bytes (the size of the activation tensor) per GPU.

Using the ring all-reduce algorithm, the time for one all-reduce is:

Tall-reduce=2×N1N×MBWT_{\text{all-reduce}} = 2 \times \frac{N-1}{N} \times \frac{M}{\text{BW}}

where BW is the per-GPU interconnect bandwidth and the factor of 2 accounts for the reduce-scatter and all-gather phases.

For a 70B model with hidden dimension 8192, sequence length 1, and FP16 activations:

M=8192×2=16,384 bytes per tokenM = 8192 \times 2 = 16{,}384 \text{ bytes per token}

Per layer (2 all-reduces):

Tper-layer=2×2×78×16,384BWT_{\text{per-layer}} = 2 \times 2 \times \frac{7}{8} \times \frac{16{,}384}{\text{BW}}

📊

All-Reduce Time per Layer (70B, 8-way TP, batch=1)

InterconnectBW per GPUTime per All-ReduceTime per Layer (2x)80 Layers Total
PCIe Gen4 64 GB/s 0.45 us 0.89 us 71.2 us
PCIe Gen5 128 GB/s 0.22 us 0.45 us 35.6 us
NVLink 3.0 (A100) 600 GB/s 0.048 us 0.095 us 7.6 us
NVLink 4.0 (H100) 900 GB/s 0.032 us 0.063 us 5.1 us
NVLink 5.0 (B200) 1,800 GB/s 0.016 us 0.032 us 2.5 us
Note: For batch=1, messages are small (16 KB) and latency dominates over bandwidth. At larger batch sizes, bandwidth becomes the limiting factor.
⚠️ Latency vs Bandwidth

For small messages (batch=1 decode), the all-reduce time is dominated by latency, not bandwidth. NVLink latency is approximately 1-2 microseconds per all-reduce, which makes the bandwidth calculation above optimistic. At batch=1, the real all-reduce cost is closer to 1-2 us per operation regardless of NVLink generation. At batch=256 (16 KB x 256 = 4 MB messages), bandwidth dominates and the NVLink generation matters significantly.

TP Overhead as Percentage of Compute

The critical metric is what fraction of total layer time is spent on communication:

TP overhead=TcommTcompute+Tcomm\text{TP overhead} = \frac{T_{\text{comm}}}{T_{\text{compute}} + T_{\text{comm}}}

For a 70B model decode at batch=32, FP8 on H100:

  • Compute time per layer: approximately 12 microseconds (70B / 80 layers = 875M params/layer, 875M x 2 FLOP x 32 batch / 1,979 TFLOPS)
  • All-reduce time per layer over NVLink 4.0: approximately 4 microseconds (2 x 32 x 16 KB / 900 GB/s + 2 x 1 us latency)

TP overhead=412+4=25%\text{TP overhead} = \frac{4}{12 + 4} = 25\%

Over PCIe Gen5:

  • All-reduce time: approximately 10 microseconds (2 x 32 x 16 KB / 128 GB/s + 2 x 3 us latency)

TP overhead=1012+10=45%\text{TP overhead} = \frac{10}{12 + 10} = 45\%

This is why NVIDIA insists on NVLink for tensor parallelism. At 25% overhead, 8-way TP delivers approximately 6x throughput of a single GPU. At 45% overhead, it delivers approximately 4.4x — a 27% efficiency loss.

Peer-to-Peer Bandwidth Test

#include <cuda_runtime.h>
#include <cstdio>

void measure_p2p_bandwidth(int src_gpu, int dst_gpu) {
    size_t size = 256 * 1024 * 1024;  // 256 MB
    void *d_src, *d_dst;

    // Check P2P access
    int can_access;
    cudaDeviceCanAccessPeer(&can_access, src_gpu, dst_gpu);
    if (!can_access) {
        printf("GPU %d cannot access GPU %d peer-to-peer\n", src_gpu, dst_gpu);
        return;
    }

    // Enable P2P
    cudaSetDevice(src_gpu);
    cudaDeviceEnablePeerAccess(dst_gpu, 0);
    cudaMalloc(&d_src, size);

    cudaSetDevice(dst_gpu);
    cudaDeviceEnablePeerAccess(src_gpu, 0);
    cudaMalloc(&d_dst, size);

    // Measure
    cudaSetDevice(src_gpu);
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Warm up
    cudaMemcpyPeer(d_dst, dst_gpu, d_src, src_gpu, size);
    cudaDeviceSynchronize();

    int iterations = 50;
    cudaEventRecord(start);
    for (int i = 0; i < iterations; i++) {
        cudaMemcpyPeer(d_dst, dst_gpu, d_src, src_gpu, size);
    }
    cudaEventRecord(stop);
    cudaDeviceSynchronize();

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double bw = (double)size * iterations / (ms / 1000.0) / 1e9;

    printf("GPU %d -> GPU %d: %.1f GB/s (unidirectional)\n", src_gpu, dst_gpu, bw);

    cudaFree(d_src);
    cudaFree(d_dst);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

void measure_bidirectional(int gpu_a, int gpu_b) {
    size_t size = 256 * 1024 * 1024;
    void *d_a_src, *d_a_dst, *d_b_src, *d_b_dst;

    cudaSetDevice(gpu_a);
    cudaDeviceEnablePeerAccess(gpu_b, 0);
    cudaMalloc(&d_a_src, size);
    cudaMalloc(&d_a_dst, size);

    cudaSetDevice(gpu_b);
    cudaDeviceEnablePeerAccess(gpu_a, 0);
    cudaMalloc(&d_b_src, size);
    cudaMalloc(&d_b_dst, size);

    // Create streams on each GPU
    cudaStream_t stream_a, stream_b;
    cudaSetDevice(gpu_a);
    cudaStreamCreate(&stream_a);
    cudaSetDevice(gpu_b);
    cudaStreamCreate(&stream_b);

    // Warm up
    cudaMemcpyPeerAsync(d_b_dst, gpu_b, d_a_src, gpu_a, size, stream_a);
    cudaMemcpyPeerAsync(d_a_dst, gpu_a, d_b_src, gpu_b, size, stream_b);
    cudaSetDevice(gpu_a); cudaDeviceSynchronize();
    cudaSetDevice(gpu_b); cudaDeviceSynchronize();

    cudaSetDevice(gpu_a);
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int iterations = 50;
    cudaEventRecord(start, stream_a);
    for (int i = 0; i < iterations; i++) {
        cudaMemcpyPeerAsync(d_b_dst, gpu_b, d_a_src, gpu_a, size, stream_a);
        cudaMemcpyPeerAsync(d_a_dst, gpu_a, d_b_src, gpu_b, size, stream_b);
    }
    cudaEventRecord(stop, stream_a);
    cudaSetDevice(gpu_a); cudaStreamSynchronize(stream_a);
    cudaSetDevice(gpu_b); cudaStreamSynchronize(stream_b);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double bw = 2.0 * (double)size * iterations / (ms / 1000.0) / 1e9;

    printf("GPU %d <-> GPU %d: %.1f GB/s (bidirectional)\n", gpu_a, gpu_b, bw);

    cudaFree(d_a_src); cudaFree(d_a_dst);
    cudaFree(d_b_src); cudaFree(d_b_dst);
    cudaStreamDestroy(stream_a); cudaStreamDestroy(stream_b);
    cudaEventDestroy(start); cudaEventDestroy(stop);
}

int main() {
    int device_count;
    cudaGetDeviceCount(&device_count);
    printf("Found %d GPUs\n\n", device_count);

    printf("=== Unidirectional P2P Bandwidth ===\n");
    for (int i = 0; i < device_count; i++) {
        for (int j = 0; j < device_count; j++) {
            if (i != j) measure_p2p_bandwidth(i, j);
        }
    }

    printf("\n=== Bidirectional P2P Bandwidth ===\n");
    for (int i = 0; i < device_count; i++) {
        for (int j = i + 1; j < device_count; j++) {
            measure_bidirectional(i, j);
        }
    }

    return 0;
}

Expected Results

📊

Expected P2P Bandwidth Measurements

ConnectionUnidirectionalBidirectionalNotes
NVLink 3.0 (A100 pair) ~270-290 GB/s ~530-570 GB/s Per-pair bandwidth (not full 600 GB/s per GPU)
NVLink 4.0 (H100 pair) ~400-430 GB/s ~780-850 GB/s NVSwitch routing adds minor overhead
PCIe Gen4 (no NVLink) ~22-25 GB/s ~40-48 GB/s Through CPU root complex
PCIe Gen5 (no NVLink) ~48-55 GB/s ~90-110 GB/s Through CPU root complex
Note: Per-pair NVLink bandwidth is a fraction of per-GPU total because each GPU distributes its links across all peers.

NCCL All-Reduce Benchmark

For practical all-reduce performance (which is what tensor parallelism actually uses), benchmark with NCCL:

# Build nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu

# Run all-reduce benchmark across all GPUs in the node
mpirun -np 8 ./build/all_reduce_perf \
  -b 1K -e 1G -f 2 \
  -g 1 -t 1

# Key output columns:
#   size(B)  count     type  redop  time(us)  algbw(GB/s)  busbw(GB/s)
# algbw = algorithm bandwidth (size / time)
# busbw = bus bandwidth (algbw * (2*(n-1)/n) for ring all-reduce)
# busbw should approach per-GPU NVLink bandwidth for large messages

Typical NCCL all-reduce bus bandwidth on 8x H100 with NVLink 4.0:

  • 1 KB: approximately 0.5 GB/s (latency-dominated)
  • 64 KB: approximately 50 GB/s (transitional)
  • 1 MB: approximately 350 GB/s (approaching bandwidth limit)
  • 64 MB: approximately 420-440 GB/s (near peak per-GPU unidirectional)
  • 1 GB: approximately 430-450 GB/s (peak sustained)
Bus Bandwidth vs Algorithm Bandwidth

NCCL reports both algorithm bandwidth (data size / time) and bus bandwidth (algorithm bandwidth multiplied by 2(N1)/N2(N-1)/N). Bus bandwidth represents the actual wire utilization and should be compared against the per-GPU NVLink bandwidth. Algorithm bandwidth is what the application sees. For 8-GPU ring all-reduce, bus bandwidth is 2×7/8=1.75x2 \times 7/8 = 1.75x the algorithm bandwidth.

# Show NVLink topology
nvidia-smi topo -m

# Example output for DGX H100:
#         GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
# GPU0     X    NV18  NV18  NV18  NV18  NV18  NV18  NV18
# GPU1    NV18   X    NV18  NV18  NV18  NV18  NV18  NV18
# ...
# NV18 = connected via 18 NVLink connections (full NVSwitch)

# For PCIe-only systems:
#         GPU0  GPU1  GPU2  GPU3
# GPU0     X    PIX   PHB   SYS
# GPU1    PIX    X    SYS   PHB
# PIX = same PCIe switch
# PHB = same CPU socket, different PCIe switch
# SYS = different CPU sockets (crosses QPI/UPI)

Interconnect Strategy for LLM Serving

The interconnect determines the parallelism strategy:

📊

Parallelism Strategy by Interconnect

Parallelism TypeCommunication PatternMin Bandwidth NeededSuitable Interconnect
Tensor Parallelism All-reduce per layer (2x) High (hundreds of GB/s) NVLink only
Pipeline Parallelism Point-to-point, one activation per microbatch Moderate (tens of GB/s) NVLink or InfiniBand
Expert Parallelism (MoE) All-to-all dispatch per layer Very high (scales with experts) NVLink, ideally NVSwitch
Data Parallelism All-reduce of gradients (training only) Moderate (can overlap with compute) InfiniBand sufficient

For a concrete example: serving Llama 3 405B on 8x H100 vs 4x B200:

8x H100 (NVLink 4.0, 900 GB/s per GPU):

  • 8-way tensor parallelism
  • Each GPU holds 50 GB of weights (FP8)
  • 2 all-reduces per layer, 80 layers
  • All-reduce overhead at batch=32: approximately 25% of layer time
  • Effective throughput: approximately 75% of ideal

4x B200 (NVLink 5.0, 1,800 GB/s per GPU):

  • 4-way tensor parallelism
  • Each GPU holds 101 GB of weights (FP8)
  • 2 all-reduces per layer, 80 layers, but half the GPUs = half the communication volume
  • All-reduce overhead at batch=32: approximately 8% of layer time
  • Effective throughput: approximately 92% of ideal

Fewer GPUs with more memory and faster interconnect yields both higher absolute throughput and higher efficiency. This is the fundamental value proposition of NVLink bandwidth growth and HBM capacity growth — they reduce the number of parallelism-induced communication events.

The Full Picture: Data Center Interconnect

A production LLM serving cluster has four interconnect layers:

  1. Intra-GPU: HBM to SM (3,350 GB/s on H100)
  2. Intra-node GPU-GPU: NVLink via NVSwitch (900 GB/s per GPU on H100)
  3. Inter-node GPU-GPU: InfiniBand via GPUDirect RDMA (50-100 GB/s per GPU)
  4. Client-facing: Ethernet to load balancers and API servers (25-100 Gbps)

Each layer is 7-20x slower than the one above it. The art of distributed inference is keeping all heavy communication within the fastest available domain. Tensor parallelism stays within NVLink. Pipeline parallelism uses InfiniBand. KV cache transfer for disaggregated serving flows over RDMA. The NVL72 collapses levels 2 and 3 into a single NVLink domain for 72 GPUs, which is why NVIDIA positions it as the inference platform for trillion-parameter models.