The NVL72 system connects 72 Blackwell GPUs through a fabric of NVSwitch 4.0 ASICs that provides 1.8 TB/s per GPU — totaling 130 TB/s of aggregate bisection bandwidth. Every GPU can access every other GPU’s memory at near-local speeds. A single allreduce across all 72 GPUs completes in the time it previously took to reduce across 8. The entire 72-GPU system presents a single unified memory address space: a pointer on GPU 0 can reference memory on GPU 71 without explicit copies. This is the NVLink network — not a bus, not a switched interconnect in the traditional sense, but a multi-hop, multi-rail, switched fabric with in-network computation.
This post traces the evolution of NVSwitch from the original DGX-1 through NVL72, covering the switch ASIC architecture, topology design, bandwidth calculations, the SHARP in-network reduction engine, and the unified memory model that makes 72 GPUs behave like one device.
NVSwitch Evolution
NVSwitch Generations
| Generation | Product | GPUs Connected | BW per GPU (bidirectional) | Bisection BW | NVLink Gen |
|---|---|---|---|---|---|
| NVSwitch 1.0 | DGX-2 / HGX V100 | 16 V100 | 300 GB/s | 2.4 TB/s | NVLink 2.0 |
| NVSwitch 2.0 | DGX A100 / HGX A100 | 8 A100 | 600 GB/s | 4.8 TB/s | NVLink 3.0 |
| NVSwitch 3.0 | DGX H100 / HGX H100 | 8 H100 | 900 GB/s | 7.2 TB/s | NVLink 4.0 |
| NVSwitch 4.0 | NVL72 (GB200) | 72 B200 | 1,800 GB/s | 130 TB/s | NVLink 5.0 |
NVSwitch ASIC Architecture
What Is Inside an NVSwitch
Each NVSwitch is a dedicated switching ASIC — not a GPU, not a CPU, but a purpose-built crossbar switch with NVLink ports:
// NVSwitch 3.0 (Hopper generation):
// - 64 NVLink 4.0 ports
// - Each port: 25 GB/s per direction (50 GB/s bidirectional)
// - Total switching capacity: 64 * 50 = 3,200 GB/s
// - Non-blocking crossbar: any port can talk to any other port simultaneously
// - SHARP engine: in-network allreduce and broadcast
// - Die size: ~294 mm² on TSMC 4N
// - TDP: ~120W
// NVSwitch 4.0 (Blackwell generation):
// - 128 NVLink 5.0 ports (doubled from 3.0)
// - Each port: 50 GB/s per direction (100 GB/s bidirectional)
// - Total switching capacity: 128 * 100 = 12,800 GB/s
// - Enhanced SHARP 3.0 for FP8/FP4 reduction
// - Multicast support for broadcast operations
The Crossbar
The core of NVSwitch is a non-blocking crossbar. In a non-blocking crossbar, every input port can simultaneously connect to any output port without contention, as long as no two inputs target the same output:
// NVSwitch 3.0 crossbar (simplified):
// 64 input ports × 64 output ports
// Each port: 50 GB/s
// Any permutation routing is supported at full bandwidth
//
// Example: 8 GPUs, each with 18 NVLink lanes connected to switches
// GPU 0 → Switch 0, port 0-2 (3 lanes)
// GPU 0 → Switch 1, port 0-2 (3 lanes)
// GPU 0 → Switch 2, port 0-2 (3 lanes)
// ... (18 lanes total, distributed across 6 switches)
// Total per-GPU BW: 18 lanes × 50 GB/s = 900 GB/s
Topology: DGX H100 (8 GPUs)
The 6-Switch All-to-All Topology
DGX H100 uses 4 NVSwitch 3.0 ASICs to connect 8 H100 GPUs in a fully connected topology:
// DGX H100 Topology:
// 8 H100 GPUs, each with 18 NVLink 4.0 links
// 4 NVSwitch 3.0 ASICs (each with 64 ports)
//
// Connection pattern:
// Each GPU connects 18/4 ≈ 4-5 links to each switch
// (exact distribution varies, total = 18 links per GPU)
//
// GPU-to-GPU bandwidth:
// GPU 0 → GPU 1: traffic can traverse any of the 4 switches
// Each switch provides ~225 GB/s for this pair
// Total GPU-to-GPU: ~900 GB/s (same as single-GPU NVLink BW)
// This is because the topology provides full bisection bandwidth
// Bisection bandwidth:
// Split 8 GPUs into two groups of 4
// Group A → Group B: 4 GPUs × 900 GB/s / 2 = 3,600 GB/s
// But each switch serves both directions: 7,200 GB/s bidirectional
Why Full Bisection Matters
Full bisection bandwidth means that for any partition of GPUs into two equal groups, the bandwidth between the groups equals the aggregate bandwidth of one group. This ensures that collective operations (allreduce, all-to-all) run at the same speed regardless of the communication pattern:
// With full bisection bandwidth:
// 8-GPU allreduce: every GPU sends to every other GPU
// Each GPU sends 7 chunks, receives 7 chunks
// Time = data_size / (900 GB/s * 7/8) = data_size / 787.5 GB/s
// (ring allreduce achieves (N-1)/N of per-GPU bandwidth)
// Without full bisection (e.g., PCIe tree):
// Cross-socket traffic bottlenecked by root complex
// 8-GPU allreduce limited by weakest link
Topology: NVL72 (72 GPUs)
The Multi-Rack Fabric
NVL72 connects 72 Blackwell B200 GPUs across a rack-scale system using NVSwitch 4.0:
// NVL72 Physical Layout:
// 36 Grace-Blackwell Superchips (each has 1 Grace CPU + 2 B200 GPUs)
// 72 B200 GPUs total
// 18 NVSwitch 4.0 ASICs
//
// Each B200 GPU has 18 NVLink 5.0 links (1,800 GB/s total)
// Each NVSwitch 4.0 has 128 NVLink 5.0 ports
//
// Connection: each GPU connects 1 link to each of the 18 switches
// 18 switches × 1 link per GPU × 100 GB/s per link = 1,800 GB/s per GPU
//
// Total fabric bandwidth:
// 72 GPUs × 1,800 GB/s = 129,600 GB/s ≈ 130 TB/s bisection BW
The topology is a single-stage switch fabric — every GPU connects directly to every switch. There is no multi-hop routing. GPU-to-GPU traffic traverses exactly 2 links (GPU-to-switch, switch-to-GPU) regardless of which pair of GPUs is communicating. This 1-hop fabric provides uniform latency and bandwidth between any pair.
Bandwidth Calculations
// GPU-to-GPU bandwidth in NVL72:
// GPU 0 → GPU 71: traffic can traverse any of the 18 switches
// Each switch provides 100 GB/s for this pair
// Total: 18 × 100 GB/s = 1,800 GB/s
//
// But wait — the GPU's total NVLink bandwidth is also 1,800 GB/s
// So a single GPU-to-GPU transfer saturates the source GPU's bandwidth
// This means NVL72 provides full non-blocking bandwidth:
// ANY pair can communicate at 1,800 GB/s simultaneously
// (as long as no GPU is involved in multiple transfers)
// Bisection bandwidth:
// Split 72 GPUs into two groups of 36
// 18 switches, each has 36 links to each group
// Per-switch inter-group BW: 36 × 100 GB/s = 3,600 GB/s per direction
// Total inter-group BW: 18 × 3,600 = 64,800 GB/s per direction
// Bidirectional: 129,600 GB/s ≈ 130 TB/s
Allreduce Performance on NVL72
// Ring allreduce on 72 GPUs:
// Each GPU's contribution: data_size / 72
// Ring stages: 71 (N-1)
// Per-stage transfer: data_size / 72
// Time per stage: (data_size / 72) / 1,800 GB/s
// Total time: 71 × (data_size / 72) / 1,800 GB/s
// ≈ data_size / 1,825 GB/s
//
// For a 1 GB allreduce:
// Time ≈ 1 GB / 1,825 GB/s ≈ 0.55 ms
//
// Compare to DGX H100 (8 GPUs):
// Time ≈ 1 GB / (900 * 7/8) GB/s ≈ 1.27 ms
// NVL72 is faster per-GPU AND has 9x more GPUs
Allreduce Latency for 1 GB Payload
(ms (lower is better))SHARP: In-Network Reduction
The Bottleneck SHARP Solves
In a standard allreduce, every GPU sends its data to every other GPU, which performs a local reduction. The total data movement is — each GPU must read all other GPUs’ contributions:
// Standard ring allreduce on 72 GPUs:
// Each GPU reads 71 chunks and reduces locally
// Total data moved across fabric: 2 × (N-1)/N × data_size × N
// = 2 × 71/72 × 1 GB × 72 = ~142 GB total fabric traffic
// With SHARP in-network reduction:
// NVSwitch performs the reduction inside the switch
// Each GPU sends its chunk to the switch (1 GB × 72 = 72 GB upload)
// Switch reduces all chunks internally
// Switch sends result to each GPU (1 GB × 72 = 72 GB download)
// Total fabric traffic: 144 GB (similar) BUT:
// The switch performs reduction at line rate — no GPU compute wasted
// GPU is free to do useful work during the allreduce
How SHARP Works
Each NVSwitch 4.0 contains SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) engines that can perform arithmetic operations on data as it passes through the switch:
// SHARP operation flow:
// 1. GPU 0 sends its gradient chunk to NVSwitch port 0
// 2. GPU 1 sends its gradient chunk to NVSwitch port 1
// ...
// 72. GPU 71 sends its gradient chunk to NVSwitch port 71
// 73. SHARP engine receives all 72 chunks
// 74. SHARP computes: result = chunk_0 + chunk_1 + ... + chunk_71
// (FP16, BF16, FP32, or FP8 addition at line rate)
// 75. SHARP sends result to all 72 GPUs (multicast)
//
// Steps 1-72 happen simultaneously (non-blocking switch)
// Step 74 happens at wire speed — no store-and-forward
// Total time ≈ data_size / 1,800 GB/s (limited by per-GPU link BW)
Without SHARP, allreduce requires two passes over the data: reduce-scatter (aggregate partial results) and all-gather (distribute the result). With SHARP, a single pass suffices: the switch reduces and multicasts in one operation. For bandwidth-limited allreduce, this provides up to a 2x speedup. The practical improvement on NVL72 is closer to 1.5-1.7x due to protocol overhead.
SHARP Supported Operations
// SHARP 3.0 (NVSwitch 4.0) supports:
// - Sum reduction: FP16, BF16, FP32, FP8 (E4M3, E5M2)
// - Min/Max reduction: integer types
// - Barrier: zero-data synchronization
// - Multicast: one-to-all broadcast
// NOT supported by SHARP:
// - Custom reduction functions
// - Non-commutative operations
// - Operations on complex data types
// For those, fall back to GPU-side reduction
Unified Memory Address Space
How 72 GPUs Share One Address Space
NVL72 implements a hardware-coherent shared memory model. Every GPU can access any other GPU’s HBM using standard load/store instructions:
// On GPU 0, accessing GPU 71's memory:
// This works WITHOUT explicit cudaMemcpy:
__global__ void cross_gpu_kernel(float* remote_data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = remote_data[idx]; // Loads from GPU 71's HBM
// Hardware handles the NVLink transaction transparently
}
// Setup: enable peer access
for (int i = 0; i < 72; i++) {
for (int j = 0; j < 72; j++) {
if (i != j) {
cudaSetDevice(i);
cudaDeviceEnablePeerAccess(j, 0);
}
}
}
// Allocate on GPU 71
cudaSetDevice(71);
float* remote_ptr;
cudaMalloc(&remote_ptr, size);
// Launch kernel on GPU 0 that accesses GPU 71's memory
cudaSetDevice(0);
cross_gpu_kernel<<<grid, block>>>(remote_ptr);
Address Translation
Each GPU has an address translation table that maps virtual addresses to physical locations (local HBM or remote GPU + remote HBM address):
// Address translation for remote access:
// 1. GPU 0 issues load to virtual address 0x7F_0000_1000
// 2. TLB lookup: this address maps to GPU 71, physical 0x0000_1000
// 3. Request sent via NVLink: {dest_gpu: 71, addr: 0x0000_1000, size: 128B}
// 4. NVSwitch routes request to GPU 71 (1-hop)
// 5. GPU 71's memory controller reads from HBM
// 6. Data returns via NVSwitch to GPU 0
// 7. GPU 0 caches the data in its local L2
// Latency: ~2-4 microseconds (round trip through NVSwitch)
// Bandwidth: up to 1,800 GB/s (full NVLink BW per GPU)
Memory Consistency Model
NVL72 uses a relaxed consistency model for remote memory access:
// Writes to remote memory are NOT immediately visible to all GPUs
// Use memory fences for ordering:
__device__ void producer(float* remote_flag, float* remote_data) {
remote_data[0] = 42.0f; // Write data
__threadfence_system(); // Ensure write is visible to all GPUs
remote_flag[0] = 1.0f; // Signal that data is ready
}
__device__ void consumer(float* remote_flag, float* remote_data) {
while (remote_flag[0] != 1.0f) {} // Spin until flag is set
__threadfence_system(); // Ensure we see the data write
float val = remote_data[0]; // Read data (guaranteed to see 42.0f)
}
While the unified address space makes remote memory accessible, it does not make it fast. A local HBM access takes ~500 cycles. A remote access through NVSwitch takes ~5,000-10,000 cycles. Kernels that perform fine-grained random access to remote memory will perform very poorly. The unified address space is most useful for coarse-grained data movement (bulk transfers, collective operations) and for simplifying programming models (no explicit copies).
NVLink Network Protocol
Packet Structure
NVLink 5.0 uses a packet-based protocol:
// NVLink 5.0 packet format (simplified):
// Header: 8 bytes (source, destination, type, length)
// Payload: 0-256 bytes (data)
// CRC: 4 bytes (error detection)
// Total: 12-268 bytes per packet
// Packet types:
// - Read request: header only (8+4 = 12 bytes)
// - Read response: header + data (up to 268 bytes)
// - Write: header + data (up to 268 bytes)
// - Atomic: header + data + old_value (read-modify-write)
// - Sync/fence: header only
Flow Control
NVLink uses credit-based flow control to prevent buffer overflow at the receiver:
// Credit-based flow control:
// 1. Receiver advertises N credits (buffer slots) to sender
// 2. Sender can send N packets before waiting
// 3. As receiver processes packets, it returns credits
// 4. If sender runs out of credits: backpressure (stall)
//
// On NVLink 5.0:
// Credit granularity: 256 bytes (one maximum payload)
// Credits per link: 256 (typical)
// Maximum in-flight data: 256 × 256 = 64 KB per link
// With 18 links: 64 KB × 18 = 1.152 MB in-flight per GPU
Error Handling
// NVLink error handling:
// 1. CRC check on every packet
// 2. If CRC fails: packet is retransmitted (automatic)
// 3. Retry latency: ~100 ns (minimal impact on throughput)
// 4. If retry fails (e.g., link degradation): link is marked degraded
// 5. Traffic is rerouted through remaining links
// 6. GPU continues operating at reduced bandwidth
//
// NVLink link failure in NVL72:
// Each GPU has 18 links. Losing 1 link reduces BW by 1/18 ≈ 5.6%
// Losing a link to one specific switch: that switch pair loses 1 rail
// Other 17 switches still provide 94.4% of bandwidth
Practical Programming for NVSwitch Fabrics
NCCL on NVL72
NCCL (NVIDIA Collective Communication Library) is the primary API for multi-GPU collective operations:
#include <nccl.h>
// Initialize NCCL for 72 GPUs
ncclComm_t comms[72];
int devices[72];
for (int i = 0; i < 72; i++) devices[i] = i;
ncclCommInitAll(comms, 72, devices);
// Allreduce across all 72 GPUs
for (int i = 0; i < 72; i++) {
cudaSetDevice(i);
ncclAllReduce(sendbuff[i], recvbuff[i], count,
ncclFloat, ncclSum, comms[i], streams[i]);
}
// NCCL automatically:
// 1. Detects the NVSwitch topology
// 2. Selects the optimal algorithm (tree, ring, or SHARP)
// 3. Splits data across multiple NVLink rails
// 4. Pipelines sends and receives for maximum bandwidth
Topology-Aware Placement
# Query NVLink topology
nvidia-smi topo -m
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
# GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
# ...
# NV18 = Connected via 18 NVLinks (through NVSwitch)
# All pairs show NV18 → full mesh topology via NVSwitch
# On NVL72, the output shows NV18 for all 72×72 pairs
# (excluding diagonal)
NVSwitch vs InfiniBand: When to Use Which
NVSwitch vs InfiniBand for GPU Communication
| Metric | NVSwitch 4.0 (NVL72) | InfiniBand NDR 400G | Ratio |
|---|---|---|---|
| Per-GPU bandwidth | 1,800 GB/s | 50 GB/s (single port) | 36x |
| Latency (GPU-to-GPU) | ~2-4 us | ~2-5 us | ~1x |
| Scale | 72 GPUs (single fabric) | 1000s of GPUs | IB wins at scale |
| In-network compute | SHARP (sum, min/max) | SHARP (sum, min/max) | Comparable |
| Memory model | Unified address space | RDMA (explicit) | NVSwitch simpler |
| Cost per GPU | Included in NVL72 system | ~$5K per HCA + switch | NVSwitch bundled |
| Power per GPU | ~30W (switch share) | ~15W (HCA) | ~2x |
The answer is: both. NVL72 provides the intra-rack fabric (72 GPUs, 1,800 GB/s per GPU). InfiniBand (or RoCE) connects multiple NVL72 racks into a cluster (thousands of GPUs, 50-400 GB/s per GPU). The application uses NCCL for both layers — NCCL selects NVSwitch for intra-rack communication and InfiniBand for inter-rack communication, transparently.
Summary
NVSwitch transforms the GPU interconnect from a bottleneck into an enabler. The progression from 8-GPU DGX nodes to 72-GPU NVL72 racks represents a shift from “multiple GPUs connected by a bus” to “a single distributed compute engine with unified memory.” The key metrics: 1,800 GB/s per GPU (2x HBM bandwidth), 130 TB/s bisection bandwidth, 1-hop switching through NVSwitch 4.0, and SHARP in-network reduction that offloads collective operations from GPU compute.
For model training, NVL72 means that tensor parallelism — which requires allreduce at every layer and is bandwidth-sensitive — can now scale to 72 GPUs without significant communication overhead. A model that previously required 8-way tensor parallelism on H100 (one DGX node) can now use 72-way tensor parallelism on NVL72, keeping each GPU’s per-layer computation small enough to remain compute-bound. This is the architectural enabler for trillion-parameter dense models: not just more GPUs, but more GPUs connected at sufficient bandwidth to act as one.