Tensor parallelism splits every matrix multiply across multiple GPUs. After each split GEMM, the partial results must be combined via an all-reduce before the next layer can begin. For an 8-way tensor-parallel Llama 70B on H100s, each layer requires an all-reduce of approximately 17 MB. With 80 layers and 48 tokens per second, that is:
Over PCIe Gen5 at 64 GB/s bidirectional, this would consume the entire bus — and PCIe all-reduce is far less efficient than point-to-point, so actual throughput would be perhaps 30-40 GB/s. Over NVLink at 900 GB/s bidirectional, it is 7% of available bandwidth. This is why NVLink exists: tensor parallelism is impractical over PCIe for any model that requires it.
This post covers the complete NVIDIA interconnect stack: NVLink generations 1.0 through 5.0, the NVSwitch topology that enables all-to-all communication, the NVL72 rack-scale fabric, PCIe as the fallback, and InfiniBand for inter-node communication. The focus is on what these interconnects mean for LLM serving throughput.
The Interconnect Hierarchy
GPU interconnect operates at four levels, each an order of magnitude slower than the previous:
GPU Interconnect Bandwidth Hierarchy (H100 DGX System)
| Level | Technology | Bandwidth (bidirectional) | Latency | Scope |
|---|---|---|---|---|
| On-chip (SM to SM) | Crossbar / L2 | ~12 TB/s (L2 BW) | ~200 cycles | Single GPU die |
| HBM | HBM3 | 3,350 GB/s | ~400-600 cycles | GPU to local memory |
| NVLink (intra-node) | NVLink 4.0 | 900 GB/s per GPU | ~1-2 us | 8 GPUs in one node |
| PCIe (intra-node) | PCIe Gen5 x16 | 64 GB/s | ~2-5 us | GPU to CPU, GPU to NIC |
| InfiniBand (inter-node) | NDR 400G | 400 Gbps = 50 GB/s | ~1-5 us | Across nodes in cluster |
| Ethernet (inter-node) | 400GbE RoCE | 400 Gbps = 50 GB/s | ~2-10 us | Across nodes in cluster |
Each level down represents a communication boundary that affects parallelism strategy. Tensor parallelism (splitting a single layer) works within NVLink domains. Pipeline parallelism (splitting across layers) can tolerate InfiniBand latencies. Expert parallelism (MoE routing) requires high bisection bandwidth within a node.
NVLink Generations
NVLink 1.0 (Pascal P100, 2016)
The first NVLink provided 4 links per GPU at 40 GB/s each (bidirectional), for a total of 160 GB/s per GPU. This was 5x the PCIe Gen3 x16 bandwidth of 32 GB/s. The P100 could connect to up to 4 other GPUs, but the topology was limited — a fully-connected 4-GPU setup required a dedicated NVLink switch or a mesh.
NVLink 2.0 (Volta V100, 2017)
NVLink 2.0 doubled the link count to 6 per GPU at 50 GB/s each, for 300 GB/s per GPU. The V100 DGX-1 connected 8 GPUs in a hybrid cube-mesh topology — not fully connected, as 6 links could not connect to all 7 other GPUs. Some GPU pairs communicated via intermediate hops.
NVLink 3.0 (Ampere A100, 2020)
NVLink 3.0 provided 12 links per GPU at 50 GB/s each, for 600 GB/s per GPU. With 12 links and only 7 other GPUs in the node, the DGX A100 achieved a fully-connected topology — every GPU has a direct NVLink connection to every other GPU. This eliminated the routing overhead present in Volta’s hybrid topology.
The A100 also introduced NVSwitch 2.0 — a dedicated switch chip that connects all NVLink ports. Instead of point-to-point NVLink cables between GPUs, all links terminate at NVSwitch ASICs, which can route any GPU’s traffic to any other GPU. The DGX A100 uses 6 NVSwitch chips, each connecting to all 8 GPUs.
NVLink 4.0 (Hopper H100, 2022)
NVLink 4.0 increased per-link bandwidth to 50 GB/s (same as NVLink 3.0) but expanded to 18 links per GPU, for 900 GB/s per GPU. The DGX H100 uses NVSwitch 3.0 to provide full bisection bandwidth across all 8 GPUs. NVSwitch 3.0 also introduces in-network compute — the switch itself can perform reductions (sum, max) on data flowing through it, reducing all-reduce latency.
NVSwitch in-network reduction: During an all-reduce, instead of each GPU sending data to all others and then summing locally, the NVSwitch performs the partial sum as data passes through. This reduces the all-reduce from bandwidth steps to approximately 1 bandwidth step for large messages.
NVLink 5.0 (Blackwell B200, 2024)
NVLink 5.0 doubles per-link bandwidth to 100 GB/s, with 18 links per GPU for 1,800 GB/s per GPU. The Blackwell NVSwitch 4.0 supports 72 GPU connections (up from 8), enabling the NVL72 — a single rack with 72 B200 GPUs connected as one coherent domain.
NVLink Bandwidth per GPU (Bidirectional, GB/s)
(GB/s)NVLink Generation Comparison
| Spec | NVLink 1.0 | NVLink 2.0 | NVLink 3.0 | NVLink 4.0 | NVLink 5.0 |
|---|---|---|---|---|---|
| GPU | P100 | V100 | A100 | H100 | B200 |
| Links per GPU | 4 | 6 | 12 | 18 | 18 |
| BW per link (bidir.) | 40 GB/s | 50 GB/s | 50 GB/s | 50 GB/s | 100 GB/s |
| Total BW per GPU | 160 GB/s | 300 GB/s | 600 GB/s | 900 GB/s | 1,800 GB/s |
| Signaling | NRZ 20 Gbps | NRZ 25 Gbps | NRZ 25 Gbps | PAM4 50 Gbps | PAM4 100 Gbps |
| Max GPUs (NVSwitch) | N/A | N/A | 8 | 8 | 72 |
| In-network compute | No | No | No | Yes | Yes |
NVSwitch: From Node-Scale to Rack-Scale
The Topology Problem
Without NVSwitch, GPUs connect point-to-point via NVLink cables. With N GPUs and L links per GPU, each GPU can directly connect to at most L peers. For 8 GPUs with 12 links each (A100), you can fully connect all 8 (each pair uses 1-2 links). But for 16, 32, or 72 GPUs, point-to-point is impossible — you would need hundreds of links per GPU.
NVSwitch solves this by acting as a crossbar switch. Each NVSwitch chip has 64 NVLink ports on Blackwell (NVSwitch 4.0). Multiple NVSwitch chips can be combined to build a multi-stage fat-tree topology.
NVSwitch 3.0 (Hopper DGX)
The DGX H100 uses 4 NVSwitch 3.0 chips. Each NVSwitch has 64 NVLink 4.0 ports (32 ports per direction). The 8 GPUs connect to all 4 NVSwitch chips, and each NVSwitch can route between any pair of connected GPUs.
Full bisection bandwidth for the 8-GPU domain:
This means any subset of GPUs can communicate with any other subset at 900 GB/s per GPU without contention. This is critical for all-reduce operations where all GPUs send and receive simultaneously.
NVSwitch 4.0 and NVL72 (Blackwell)
The NVL72 configuration connects 72 B200 GPUs (36 Grace-Blackwell superchips, where each superchip pairs 2 B200s with 1 Grace CPU) into a single NVSwitch domain. This uses 9 NVSwitch 4.0 trays, each containing multiple NVSwitch chips.
The total bisection bandwidth is approximately 130 TB/s. Every GPU can communicate with every other GPU at full NVLink 5.0 speed (1,800 GB/s per GPU) without routing through any CPU or PCIe bus.
For a 1.8 trillion parameter model (like GPT-4 class), the NVL72 provides:
At FP4 (0.5 bytes per param), a 1.8T model occupies 900 GB — fitting across 5 GPUs. But with tensor parallelism across all 72 GPUs, each GPU holds only 12.5 GB of weights, and the aggregate bandwidth is:
A traditional 9-node cluster with 8 H100 GPUs each has 72 GPUs total. But inter-node communication uses InfiniBand at 400 Gbps (50 GB/s) — 18x slower than NVLink. Tensor parallelism across nodes is impractical. The NVL72 eliminates this bottleneck by making all 72 GPUs appear as one NVLink domain. This enables 72-way tensor parallelism, which is meaningless for small models but transformative for trillion-parameter inference.
PCIe: The Fallback Path
Every NVIDIA data center GPU has a PCIe interface in addition to NVLink. PCIe serves as the connection between GPU and CPU, GPU and network interface cards (NICs), and GPU and NVMe storage.
PCIe Generation Comparison
| PCIe Gen | Per-lane BW | x16 BW (bidir.) | GPU | Encoding |
|---|---|---|---|---|
| Gen3 | 1 GB/s | 32 GB/s | V100 | 128b/130b |
| Gen4 | 2 GB/s | 64 GB/s | A100 | 128b/130b |
| Gen5 | 4 GB/s | 128 GB/s | H100, B200 | 128b/130b |
| Gen6 | 8 GB/s | 256 GB/s | Future | PAM4, FEC |
PCIe is 7-28x slower than NVLink on the same generation:
NVLink vs PCIe Bandwidth (per GPU, bidirectional)
(GB/s)When PCIe Matters
PCIe bandwidth matters in three scenarios:
-
CPU offloading: When KV cache or model layers are partially offloaded to CPU memory (DeepSpeed ZeRO-Offload), data must traverse PCIe. At 128 GB/s (Gen5 x16), moving 1 GB of KV cache takes 7.8 ms — adding significant latency per token.
-
Network I/O: Data received from remote nodes via InfiniBand flows through the NIC, across PCIe, to GPU memory. GPUDirect RDMA (GDR) allows InfiniBand NICs to write directly to GPU memory over PCIe, bypassing CPU memory entirely.
-
Multi-GPU without NVLink: Some cloud instances provide multiple GPUs connected only via PCIe (no NVLink). Tensor parallelism over PCIe is 7-14x slower than over NVLink, making it practical only for pipeline parallelism or data parallelism.
InfiniBand: Inter-Node Communication
Within a node, NVLink handles GPU-to-GPU communication. Between nodes, InfiniBand (IB) or Ethernet handles the traffic. For LLM training and multi-node inference, inter-node bandwidth determines pipeline parallelism and data parallelism efficiency.
InfiniBand Generations for AI Clusters
| IB Generation | Per-port BW | Links per GPU | Effective GPU BW | Deployment |
|---|---|---|---|---|
| EDR (2014) | 25 GB/s | 1 | 25 GB/s | DGX-1 (V100) |
| HDR (2019) | 25 GB/s (per port) | 1-2 | 25-50 GB/s | DGX A100 |
| NDR (2022) | 50 GB/s (400G) | 1 | 50 GB/s | DGX H100 |
| NDR 800G (2024) | 100 GB/s | 1 | 100 GB/s | DGX B200 |
GPUDirect RDMA
Without GPUDirect RDMA, inter-node GPU communication follows this path:
GPU0 (Node A) -> PCIe -> CPU memory (Node A) -> NIC -> Network ->
NIC -> CPU memory (Node B) -> PCIe -> GPU0 (Node B)
Two PCIe traversals and two CPU memory copies. GPUDirect RDMA eliminates the CPU memory copies:
GPU0 (Node A) -> PCIe -> NIC -> Network ->
NIC -> PCIe -> GPU0 (Node B)
The NIC reads directly from GPU memory and writes directly to GPU memory. This reduces latency by approximately 2-5 microseconds and doubles effective bandwidth (no CPU memory bottleneck).
GPUDirect RDMA over NVLink (GDR-NVL)
On DGX H100 and newer, the ConnectX-7 InfiniBand NIC connects to the NVSwitch fabric rather than PCIe. This means inter-node transfers flow through NVLink instead of PCIe:
GPU0 (Node A) -> NVLink -> NVSwitch -> NIC -> Network ->
NIC -> NVSwitch -> NVLink -> GPU0 (Node B)
The NVLink path provides 900 GB/s (Hopper) vs 128 GB/s (PCIe Gen5), but the network itself is the bottleneck at 50-100 GB/s. The benefit of NVLink routing is lower latency and the ability to aggregate traffic from multiple GPUs through the same NIC without PCIe contention.
Why Interconnect Determines TP Efficiency
All-Reduce Cost Analysis
In tensor parallelism with GPUs, each layer requires two all-reduce operations: one after the column-parallel GEMM and one after the row-parallel GEMM. Each all-reduce transfers approximately bytes (the size of the activation tensor) per GPU.
Using the ring all-reduce algorithm, the time for one all-reduce is:
where BW is the per-GPU interconnect bandwidth and the factor of 2 accounts for the reduce-scatter and all-gather phases.
For a 70B model with hidden dimension 8192, sequence length 1, and FP16 activations:
Per layer (2 all-reduces):
All-Reduce Time per Layer (70B, 8-way TP, batch=1)
| Interconnect | BW per GPU | Time per All-Reduce | Time per Layer (2x) | 80 Layers Total |
|---|---|---|---|---|
| PCIe Gen4 | 64 GB/s | 0.45 us | 0.89 us | 71.2 us |
| PCIe Gen5 | 128 GB/s | 0.22 us | 0.45 us | 35.6 us |
| NVLink 3.0 (A100) | 600 GB/s | 0.048 us | 0.095 us | 7.6 us |
| NVLink 4.0 (H100) | 900 GB/s | 0.032 us | 0.063 us | 5.1 us |
| NVLink 5.0 (B200) | 1,800 GB/s | 0.016 us | 0.032 us | 2.5 us |
For small messages (batch=1 decode), the all-reduce time is dominated by latency, not bandwidth. NVLink latency is approximately 1-2 microseconds per all-reduce, which makes the bandwidth calculation above optimistic. At batch=1, the real all-reduce cost is closer to 1-2 us per operation regardless of NVLink generation. At batch=256 (16 KB x 256 = 4 MB messages), bandwidth dominates and the NVLink generation matters significantly.
TP Overhead as Percentage of Compute
The critical metric is what fraction of total layer time is spent on communication:
For a 70B model decode at batch=32, FP8 on H100:
- Compute time per layer: approximately 12 microseconds (70B / 80 layers = 875M params/layer, 875M x 2 FLOP x 32 batch / 1,979 TFLOPS)
- All-reduce time per layer over NVLink 4.0: approximately 4 microseconds (2 x 32 x 16 KB / 900 GB/s + 2 x 1 us latency)
Over PCIe Gen5:
- All-reduce time: approximately 10 microseconds (2 x 32 x 16 KB / 128 GB/s + 2 x 3 us latency)
This is why NVIDIA insists on NVLink for tensor parallelism. At 25% overhead, 8-way TP delivers approximately 6x throughput of a single GPU. At 45% overhead, it delivers approximately 4.4x — a 27% efficiency loss.
Implementation: Measuring NVLink vs PCIe Bandwidth
Peer-to-Peer Bandwidth Test
#include <cuda_runtime.h>
#include <cstdio>
void measure_p2p_bandwidth(int src_gpu, int dst_gpu) {
size_t size = 256 * 1024 * 1024; // 256 MB
void *d_src, *d_dst;
// Check P2P access
int can_access;
cudaDeviceCanAccessPeer(&can_access, src_gpu, dst_gpu);
if (!can_access) {
printf("GPU %d cannot access GPU %d peer-to-peer\n", src_gpu, dst_gpu);
return;
}
// Enable P2P
cudaSetDevice(src_gpu);
cudaDeviceEnablePeerAccess(dst_gpu, 0);
cudaMalloc(&d_src, size);
cudaSetDevice(dst_gpu);
cudaDeviceEnablePeerAccess(src_gpu, 0);
cudaMalloc(&d_dst, size);
// Measure
cudaSetDevice(src_gpu);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Warm up
cudaMemcpyPeer(d_dst, dst_gpu, d_src, src_gpu, size);
cudaDeviceSynchronize();
int iterations = 50;
cudaEventRecord(start);
for (int i = 0; i < iterations; i++) {
cudaMemcpyPeer(d_dst, dst_gpu, d_src, src_gpu, size);
}
cudaEventRecord(stop);
cudaDeviceSynchronize();
float ms;
cudaEventElapsedTime(&ms, start, stop);
double bw = (double)size * iterations / (ms / 1000.0) / 1e9;
printf("GPU %d -> GPU %d: %.1f GB/s (unidirectional)\n", src_gpu, dst_gpu, bw);
cudaFree(d_src);
cudaFree(d_dst);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
void measure_bidirectional(int gpu_a, int gpu_b) {
size_t size = 256 * 1024 * 1024;
void *d_a_src, *d_a_dst, *d_b_src, *d_b_dst;
cudaSetDevice(gpu_a);
cudaDeviceEnablePeerAccess(gpu_b, 0);
cudaMalloc(&d_a_src, size);
cudaMalloc(&d_a_dst, size);
cudaSetDevice(gpu_b);
cudaDeviceEnablePeerAccess(gpu_a, 0);
cudaMalloc(&d_b_src, size);
cudaMalloc(&d_b_dst, size);
// Create streams on each GPU
cudaStream_t stream_a, stream_b;
cudaSetDevice(gpu_a);
cudaStreamCreate(&stream_a);
cudaSetDevice(gpu_b);
cudaStreamCreate(&stream_b);
// Warm up
cudaMemcpyPeerAsync(d_b_dst, gpu_b, d_a_src, gpu_a, size, stream_a);
cudaMemcpyPeerAsync(d_a_dst, gpu_a, d_b_src, gpu_b, size, stream_b);
cudaSetDevice(gpu_a); cudaDeviceSynchronize();
cudaSetDevice(gpu_b); cudaDeviceSynchronize();
cudaSetDevice(gpu_a);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int iterations = 50;
cudaEventRecord(start, stream_a);
for (int i = 0; i < iterations; i++) {
cudaMemcpyPeerAsync(d_b_dst, gpu_b, d_a_src, gpu_a, size, stream_a);
cudaMemcpyPeerAsync(d_a_dst, gpu_a, d_b_src, gpu_b, size, stream_b);
}
cudaEventRecord(stop, stream_a);
cudaSetDevice(gpu_a); cudaStreamSynchronize(stream_a);
cudaSetDevice(gpu_b); cudaStreamSynchronize(stream_b);
float ms;
cudaEventElapsedTime(&ms, start, stop);
double bw = 2.0 * (double)size * iterations / (ms / 1000.0) / 1e9;
printf("GPU %d <-> GPU %d: %.1f GB/s (bidirectional)\n", gpu_a, gpu_b, bw);
cudaFree(d_a_src); cudaFree(d_a_dst);
cudaFree(d_b_src); cudaFree(d_b_dst);
cudaStreamDestroy(stream_a); cudaStreamDestroy(stream_b);
cudaEventDestroy(start); cudaEventDestroy(stop);
}
int main() {
int device_count;
cudaGetDeviceCount(&device_count);
printf("Found %d GPUs\n\n", device_count);
printf("=== Unidirectional P2P Bandwidth ===\n");
for (int i = 0; i < device_count; i++) {
for (int j = 0; j < device_count; j++) {
if (i != j) measure_p2p_bandwidth(i, j);
}
}
printf("\n=== Bidirectional P2P Bandwidth ===\n");
for (int i = 0; i < device_count; i++) {
for (int j = i + 1; j < device_count; j++) {
measure_bidirectional(i, j);
}
}
return 0;
}
Expected Results
Expected P2P Bandwidth Measurements
| Connection | Unidirectional | Bidirectional | Notes |
|---|---|---|---|
| NVLink 3.0 (A100 pair) | ~270-290 GB/s | ~530-570 GB/s | Per-pair bandwidth (not full 600 GB/s per GPU) |
| NVLink 4.0 (H100 pair) | ~400-430 GB/s | ~780-850 GB/s | NVSwitch routing adds minor overhead |
| PCIe Gen4 (no NVLink) | ~22-25 GB/s | ~40-48 GB/s | Through CPU root complex |
| PCIe Gen5 (no NVLink) | ~48-55 GB/s | ~90-110 GB/s | Through CPU root complex |
NCCL All-Reduce Benchmark
For practical all-reduce performance (which is what tensor parallelism actually uses), benchmark with NCCL:
# Build nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu
# Run all-reduce benchmark across all GPUs in the node
mpirun -np 8 ./build/all_reduce_perf \
-b 1K -e 1G -f 2 \
-g 1 -t 1
# Key output columns:
# size(B) count type redop time(us) algbw(GB/s) busbw(GB/s)
# algbw = algorithm bandwidth (size / time)
# busbw = bus bandwidth (algbw * (2*(n-1)/n) for ring all-reduce)
# busbw should approach per-GPU NVLink bandwidth for large messages
Typical NCCL all-reduce bus bandwidth on 8x H100 with NVLink 4.0:
- 1 KB: approximately 0.5 GB/s (latency-dominated)
- 64 KB: approximately 50 GB/s (transitional)
- 1 MB: approximately 350 GB/s (approaching bandwidth limit)
- 64 MB: approximately 420-440 GB/s (near peak per-GPU unidirectional)
- 1 GB: approximately 430-450 GB/s (peak sustained)
NCCL reports both algorithm bandwidth (data size / time) and bus bandwidth (algorithm bandwidth multiplied by ). Bus bandwidth represents the actual wire utilization and should be compared against the per-GPU NVLink bandwidth. Algorithm bandwidth is what the application sees. For 8-GPU ring all-reduce, bus bandwidth is the algorithm bandwidth.
Detecting NVLink Topology
# Show NVLink topology
nvidia-smi topo -m
# Example output for DGX H100:
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
# GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
# ...
# NV18 = connected via 18 NVLink connections (full NVSwitch)
# For PCIe-only systems:
# GPU0 GPU1 GPU2 GPU3
# GPU0 X PIX PHB SYS
# GPU1 PIX X SYS PHB
# PIX = same PCIe switch
# PHB = same CPU socket, different PCIe switch
# SYS = different CPU sockets (crosses QPI/UPI)
Interconnect Strategy for LLM Serving
The interconnect determines the parallelism strategy:
Parallelism Strategy by Interconnect
| Parallelism Type | Communication Pattern | Min Bandwidth Needed | Suitable Interconnect |
|---|---|---|---|
| Tensor Parallelism | All-reduce per layer (2x) | High (hundreds of GB/s) | NVLink only |
| Pipeline Parallelism | Point-to-point, one activation per microbatch | Moderate (tens of GB/s) | NVLink or InfiniBand |
| Expert Parallelism (MoE) | All-to-all dispatch per layer | Very high (scales with experts) | NVLink, ideally NVSwitch |
| Data Parallelism | All-reduce of gradients (training only) | Moderate (can overlap with compute) | InfiniBand sufficient |
For a concrete example: serving Llama 3 405B on 8x H100 vs 4x B200:
8x H100 (NVLink 4.0, 900 GB/s per GPU):
- 8-way tensor parallelism
- Each GPU holds 50 GB of weights (FP8)
- 2 all-reduces per layer, 80 layers
- All-reduce overhead at batch=32: approximately 25% of layer time
- Effective throughput: approximately 75% of ideal
4x B200 (NVLink 5.0, 1,800 GB/s per GPU):
- 4-way tensor parallelism
- Each GPU holds 101 GB of weights (FP8)
- 2 all-reduces per layer, 80 layers, but half the GPUs = half the communication volume
- All-reduce overhead at batch=32: approximately 8% of layer time
- Effective throughput: approximately 92% of ideal
Fewer GPUs with more memory and faster interconnect yields both higher absolute throughput and higher efficiency. This is the fundamental value proposition of NVLink bandwidth growth and HBM capacity growth — they reduce the number of parallelism-induced communication events.
The Full Picture: Data Center Interconnect
A production LLM serving cluster has four interconnect layers:
- Intra-GPU: HBM to SM (3,350 GB/s on H100)
- Intra-node GPU-GPU: NVLink via NVSwitch (900 GB/s per GPU on H100)
- Inter-node GPU-GPU: InfiniBand via GPUDirect RDMA (50-100 GB/s per GPU)
- Client-facing: Ethernet to load balancers and API servers (25-100 Gbps)
Each layer is 7-20x slower than the one above it. The art of distributed inference is keeping all heavy communication within the fastest available domain. Tensor parallelism stays within NVLink. Pipeline parallelism uses InfiniBand. KV cache transfer for disaggregated serving flows over RDMA. The NVL72 collapses levels 2 and 3 into a single NVLink domain for 72 GPUs, which is why NVIDIA positions it as the inference platform for trillion-parameter models.