Part of Series GPU Hardware & AI Accelerators 15 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The NVL72 system connects 72 Blackwell GPUs through a fabric of NVSwitch 4.0 ASICs that provides 1.8 TB/s per GPU — totaling 130 TB/s of aggregate bisection bandwidth. Every GPU can access every other GPU’s memory at near-local speeds. A single allreduce across all 72 GPUs completes in the time it previously took to reduce across 8. The entire 72-GPU system presents a single unified memory address space: a pointer on GPU 0 can reference memory on GPU 71 without explicit copies. This is the NVLink network — not a bus, not a switched interconnect in the traditional sense, but a multi-hop, multi-rail, switched fabric with in-network computation.

This post traces the evolution of NVSwitch from the original DGX-1 through NVL72, covering the switch ASIC architecture, topology design, bandwidth calculations, the SHARP in-network reduction engine, and the unified memory model that makes 72 GPUs behave like one device.

NVSwitch Evolution

📊

NVSwitch Generations

GenerationProductGPUs ConnectedBW per GPU (bidirectional)Bisection BWNVLink Gen
NVSwitch 1.0 DGX-2 / HGX V100 16 V100 300 GB/s 2.4 TB/s NVLink 2.0
NVSwitch 2.0 DGX A100 / HGX A100 8 A100 600 GB/s 4.8 TB/s NVLink 3.0
NVSwitch 3.0 DGX H100 / HGX H100 8 H100 900 GB/s 7.2 TB/s NVLink 4.0
NVSwitch 4.0 NVL72 (GB200) 72 B200 1,800 GB/s 130 TB/s NVLink 5.0
Note: Each generation roughly doubles per-GPU bandwidth. NVL72 represents a 54x increase in bisection bandwidth over DGX A100.

NVSwitch ASIC Architecture

What Is Inside an NVSwitch

Each NVSwitch is a dedicated switching ASIC — not a GPU, not a CPU, but a purpose-built crossbar switch with NVLink ports:

// NVSwitch 3.0 (Hopper generation):
// - 64 NVLink 4.0 ports
// - Each port: 25 GB/s per direction (50 GB/s bidirectional)
// - Total switching capacity: 64 * 50 = 3,200 GB/s
// - Non-blocking crossbar: any port can talk to any other port simultaneously
// - SHARP engine: in-network allreduce and broadcast
// - Die size: ~294 mm² on TSMC 4N
// - TDP: ~120W

// NVSwitch 4.0 (Blackwell generation):
// - 128 NVLink 5.0 ports (doubled from 3.0)
// - Each port: 50 GB/s per direction (100 GB/s bidirectional)
// - Total switching capacity: 128 * 100 = 12,800 GB/s
// - Enhanced SHARP 3.0 for FP8/FP4 reduction
// - Multicast support for broadcast operations

The Crossbar

The core of NVSwitch is a non-blocking crossbar. In a non-blocking crossbar, every input port can simultaneously connect to any output port without contention, as long as no two inputs target the same output:

// NVSwitch 3.0 crossbar (simplified):
// 64 input ports × 64 output ports
// Each port: 50 GB/s
// Any permutation routing is supported at full bandwidth
//
// Example: 8 GPUs, each with 18 NVLink lanes connected to switches
// GPU 0 → Switch 0, port 0-2  (3 lanes)
// GPU 0 → Switch 1, port 0-2  (3 lanes)
// GPU 0 → Switch 2, port 0-2  (3 lanes)
// ... (18 lanes total, distributed across 6 switches)
// Total per-GPU BW: 18 lanes × 50 GB/s = 900 GB/s

Topology: DGX H100 (8 GPUs)

The 6-Switch All-to-All Topology

DGX H100 uses 4 NVSwitch 3.0 ASICs to connect 8 H100 GPUs in a fully connected topology:

// DGX H100 Topology:
// 8 H100 GPUs, each with 18 NVLink 4.0 links
// 4 NVSwitch 3.0 ASICs (each with 64 ports)
//
// Connection pattern:
// Each GPU connects 18/4 ≈ 4-5 links to each switch
// (exact distribution varies, total = 18 links per GPU)
//
// GPU-to-GPU bandwidth:
// GPU 0 → GPU 1: traffic can traverse any of the 4 switches
// Each switch provides ~225 GB/s for this pair
// Total GPU-to-GPU: ~900 GB/s (same as single-GPU NVLink BW)
// This is because the topology provides full bisection bandwidth

// Bisection bandwidth:
// Split 8 GPUs into two groups of 4
// Group A → Group B: 4 GPUs × 900 GB/s / 2 = 3,600 GB/s
// But each switch serves both directions: 7,200 GB/s bidirectional

Why Full Bisection Matters

Full bisection bandwidth means that for any partition of GPUs into two equal groups, the bandwidth between the groups equals the aggregate bandwidth of one group. This ensures that collective operations (allreduce, all-to-all) run at the same speed regardless of the communication pattern:

// With full bisection bandwidth:
// 8-GPU allreduce: every GPU sends to every other GPU
// Each GPU sends 7 chunks, receives 7 chunks
// Time = data_size / (900 GB/s * 7/8) = data_size / 787.5 GB/s
// (ring allreduce achieves (N-1)/N of per-GPU bandwidth)

// Without full bisection (e.g., PCIe tree):
// Cross-socket traffic bottlenecked by root complex
// 8-GPU allreduce limited by weakest link

Topology: NVL72 (72 GPUs)

The Multi-Rack Fabric

NVL72 connects 72 Blackwell B200 GPUs across a rack-scale system using NVSwitch 4.0:

// NVL72 Physical Layout:
// 36 Grace-Blackwell Superchips (each has 1 Grace CPU + 2 B200 GPUs)
// 72 B200 GPUs total
// 18 NVSwitch 4.0 ASICs
//
// Each B200 GPU has 18 NVLink 5.0 links (1,800 GB/s total)
// Each NVSwitch 4.0 has 128 NVLink 5.0 ports
//
// Connection: each GPU connects 1 link to each of the 18 switches
// 18 switches × 1 link per GPU × 100 GB/s per link = 1,800 GB/s per GPU
//
// Total fabric bandwidth:
// 72 GPUs × 1,800 GB/s = 129,600 GB/s ≈ 130 TB/s bisection BW
ℹ️ NVL72 Is a 2-Level Fat Tree

The topology is a single-stage switch fabric — every GPU connects directly to every switch. There is no multi-hop routing. GPU-to-GPU traffic traverses exactly 2 links (GPU-to-switch, switch-to-GPU) regardless of which pair of GPUs is communicating. This 1-hop fabric provides uniform latency and bandwidth between any pair.

Bandwidth Calculations

// GPU-to-GPU bandwidth in NVL72:
// GPU 0 → GPU 71: traffic can traverse any of the 18 switches
// Each switch provides 100 GB/s for this pair
// Total: 18 × 100 GB/s = 1,800 GB/s
//
// But wait — the GPU's total NVLink bandwidth is also 1,800 GB/s
// So a single GPU-to-GPU transfer saturates the source GPU's bandwidth
// This means NVL72 provides full non-blocking bandwidth:
// ANY pair can communicate at 1,800 GB/s simultaneously
// (as long as no GPU is involved in multiple transfers)

// Bisection bandwidth:
// Split 72 GPUs into two groups of 36
// 18 switches, each has 36 links to each group
// Per-switch inter-group BW: 36 × 100 GB/s = 3,600 GB/s per direction
// Total inter-group BW: 18 × 3,600 = 64,800 GB/s per direction
// Bidirectional: 129,600 GB/s ≈ 130 TB/s

Allreduce Performance on NVL72

// Ring allreduce on 72 GPUs:
// Each GPU's contribution: data_size / 72
// Ring stages: 71 (N-1)
// Per-stage transfer: data_size / 72
// Time per stage: (data_size / 72) / 1,800 GB/s
// Total time: 71 × (data_size / 72) / 1,800 GB/s
//           ≈ data_size / 1,825 GB/s
//
// For a 1 GB allreduce:
// Time ≈ 1 GB / 1,825 GB/s ≈ 0.55 ms
//
// Compare to DGX H100 (8 GPUs):
// Time ≈ 1 GB / (900 * 7/8) GB/s ≈ 1.27 ms
// NVL72 is faster per-GPU AND has 9x more GPUs

Allreduce Latency for 1 GB Payload

(ms (lower is better))
DGX A100 (8 GPUs, NVLink 3) 1.9 ms — 600 GB/s per GPU
1.9 ms (lower is better)
DGX H100 (8 GPUs, NVLink 4) 1.27 ms — 900 GB/s per GPU
1.27 ms (lower is better)
NVL72 (72 GPUs, NVLink 5) 0.55 ms — 1,800 GB/s per GPU
0.55 ms (lower is better)

SHARP: In-Network Reduction

The Bottleneck SHARP Solves

In a standard allreduce, every GPU sends its data to every other GPU, which performs a local reduction. The total data movement is O(N)O(N) — each GPU must read all other GPUs’ contributions:

// Standard ring allreduce on 72 GPUs:
// Each GPU reads 71 chunks and reduces locally
// Total data moved across fabric: 2 × (N-1)/N × data_size × N
// = 2 × 71/72 × 1 GB × 72 = ~142 GB total fabric traffic

// With SHARP in-network reduction:
// NVSwitch performs the reduction inside the switch
// Each GPU sends its chunk to the switch (1 GB × 72 = 72 GB upload)
// Switch reduces all chunks internally
// Switch sends result to each GPU (1 GB × 72 = 72 GB download)
// Total fabric traffic: 144 GB (similar) BUT:
// The switch performs reduction at line rate — no GPU compute wasted
// GPU is free to do useful work during the allreduce

How SHARP Works

Each NVSwitch 4.0 contains SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) engines that can perform arithmetic operations on data as it passes through the switch:

// SHARP operation flow:
// 1. GPU 0 sends its gradient chunk to NVSwitch port 0
// 2. GPU 1 sends its gradient chunk to NVSwitch port 1
// ...
// 72. GPU 71 sends its gradient chunk to NVSwitch port 71
// 73. SHARP engine receives all 72 chunks
// 74. SHARP computes: result = chunk_0 + chunk_1 + ... + chunk_71
//     (FP16, BF16, FP32, or FP8 addition at line rate)
// 75. SHARP sends result to all 72 GPUs (multicast)
//
// Steps 1-72 happen simultaneously (non-blocking switch)
// Step 74 happens at wire speed — no store-and-forward
// Total time ≈ data_size / 1,800 GB/s (limited by per-GPU link BW)
SHARP Halves Allreduce Time

Without SHARP, allreduce requires two passes over the data: reduce-scatter (aggregate partial results) and all-gather (distribute the result). With SHARP, a single pass suffices: the switch reduces and multicasts in one operation. For bandwidth-limited allreduce, this provides up to a 2x speedup. The practical improvement on NVL72 is closer to 1.5-1.7x due to protocol overhead.

SHARP Supported Operations

// SHARP 3.0 (NVSwitch 4.0) supports:
// - Sum reduction: FP16, BF16, FP32, FP8 (E4M3, E5M2)
// - Min/Max reduction: integer types
// - Barrier: zero-data synchronization
// - Multicast: one-to-all broadcast

// NOT supported by SHARP:
// - Custom reduction functions
// - Non-commutative operations
// - Operations on complex data types
// For those, fall back to GPU-side reduction

Unified Memory Address Space

How 72 GPUs Share One Address Space

NVL72 implements a hardware-coherent shared memory model. Every GPU can access any other GPU’s HBM using standard load/store instructions:

// On GPU 0, accessing GPU 71's memory:
// This works WITHOUT explicit cudaMemcpy:
__global__ void cross_gpu_kernel(float* remote_data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = remote_data[idx];  // Loads from GPU 71's HBM
    // Hardware handles the NVLink transaction transparently
}

// Setup: enable peer access
for (int i = 0; i < 72; i++) {
    for (int j = 0; j < 72; j++) {
        if (i != j) {
            cudaSetDevice(i);
            cudaDeviceEnablePeerAccess(j, 0);
        }
    }
}

// Allocate on GPU 71
cudaSetDevice(71);
float* remote_ptr;
cudaMalloc(&remote_ptr, size);

// Launch kernel on GPU 0 that accesses GPU 71's memory
cudaSetDevice(0);
cross_gpu_kernel<<<grid, block>>>(remote_ptr);

Address Translation

Each GPU has an address translation table that maps virtual addresses to physical locations (local HBM or remote GPU + remote HBM address):

// Address translation for remote access:
// 1. GPU 0 issues load to virtual address 0x7F_0000_1000
// 2. TLB lookup: this address maps to GPU 71, physical 0x0000_1000
// 3. Request sent via NVLink: {dest_gpu: 71, addr: 0x0000_1000, size: 128B}
// 4. NVSwitch routes request to GPU 71 (1-hop)
// 5. GPU 71's memory controller reads from HBM
// 6. Data returns via NVSwitch to GPU 0
// 7. GPU 0 caches the data in its local L2

// Latency: ~2-4 microseconds (round trip through NVSwitch)
// Bandwidth: up to 1,800 GB/s (full NVLink BW per GPU)

Memory Consistency Model

NVL72 uses a relaxed consistency model for remote memory access:

// Writes to remote memory are NOT immediately visible to all GPUs
// Use memory fences for ordering:
__device__ void producer(float* remote_flag, float* remote_data) {
    remote_data[0] = 42.0f;          // Write data
    __threadfence_system();            // Ensure write is visible to all GPUs
    remote_flag[0] = 1.0f;            // Signal that data is ready
}

__device__ void consumer(float* remote_flag, float* remote_data) {
    while (remote_flag[0] != 1.0f) {} // Spin until flag is set
    __threadfence_system();            // Ensure we see the data write
    float val = remote_data[0];        // Read data (guaranteed to see 42.0f)
}
⚠️ Remote Access Latency Is 10-100x Local

While the unified address space makes remote memory accessible, it does not make it fast. A local HBM access takes ~500 cycles. A remote access through NVSwitch takes ~5,000-10,000 cycles. Kernels that perform fine-grained random access to remote memory will perform very poorly. The unified address space is most useful for coarse-grained data movement (bulk transfers, collective operations) and for simplifying programming models (no explicit copies).

Packet Structure

NVLink 5.0 uses a packet-based protocol:

// NVLink 5.0 packet format (simplified):
// Header:  8 bytes (source, destination, type, length)
// Payload: 0-256 bytes (data)
// CRC:     4 bytes (error detection)
// Total:   12-268 bytes per packet

// Packet types:
// - Read request:  header only (8+4 = 12 bytes)
// - Read response: header + data (up to 268 bytes)
// - Write:         header + data (up to 268 bytes)
// - Atomic:        header + data + old_value (read-modify-write)
// - Sync/fence:    header only

Flow Control

NVLink uses credit-based flow control to prevent buffer overflow at the receiver:

// Credit-based flow control:
// 1. Receiver advertises N credits (buffer slots) to sender
// 2. Sender can send N packets before waiting
// 3. As receiver processes packets, it returns credits
// 4. If sender runs out of credits: backpressure (stall)
//
// On NVLink 5.0:
// Credit granularity: 256 bytes (one maximum payload)
// Credits per link: 256 (typical)
// Maximum in-flight data: 256 × 256 = 64 KB per link
// With 18 links: 64 KB × 18 = 1.152 MB in-flight per GPU

Error Handling

// NVLink error handling:
// 1. CRC check on every packet
// 2. If CRC fails: packet is retransmitted (automatic)
// 3. Retry latency: ~100 ns (minimal impact on throughput)
// 4. If retry fails (e.g., link degradation): link is marked degraded
// 5. Traffic is rerouted through remaining links
// 6. GPU continues operating at reduced bandwidth
//
// NVLink link failure in NVL72:
// Each GPU has 18 links. Losing 1 link reduces BW by 1/18 ≈ 5.6%
// Losing a link to one specific switch: that switch pair loses 1 rail
// Other 17 switches still provide 94.4% of bandwidth

Practical Programming for NVSwitch Fabrics

NCCL on NVL72

NCCL (NVIDIA Collective Communication Library) is the primary API for multi-GPU collective operations:

#include <nccl.h>

// Initialize NCCL for 72 GPUs
ncclComm_t comms[72];
int devices[72];
for (int i = 0; i < 72; i++) devices[i] = i;
ncclCommInitAll(comms, 72, devices);

// Allreduce across all 72 GPUs
for (int i = 0; i < 72; i++) {
    cudaSetDevice(i);
    ncclAllReduce(sendbuff[i], recvbuff[i], count,
                  ncclFloat, ncclSum, comms[i], streams[i]);
}

// NCCL automatically:
// 1. Detects the NVSwitch topology
// 2. Selects the optimal algorithm (tree, ring, or SHARP)
// 3. Splits data across multiple NVLink rails
// 4. Pipelines sends and receives for maximum bandwidth

Topology-Aware Placement

# Query NVLink topology
nvidia-smi topo -m
# GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
# GPU0   X    NV18  NV18  NV18  NV18  NV18  NV18  NV18
# GPU1  NV18   X    NV18  NV18  NV18  NV18  NV18  NV18
# ...
# NV18 = Connected via 18 NVLinks (through NVSwitch)
# All pairs show NV18 → full mesh topology via NVSwitch

# On NVL72, the output shows NV18 for all 72×72 pairs
# (excluding diagonal)

NVSwitch vs InfiniBand: When to Use Which

📊

NVSwitch vs InfiniBand for GPU Communication

MetricNVSwitch 4.0 (NVL72)InfiniBand NDR 400GRatio
Per-GPU bandwidth 1,800 GB/s 50 GB/s (single port) 36x
Latency (GPU-to-GPU) ~2-4 us ~2-5 us ~1x
Scale 72 GPUs (single fabric) 1000s of GPUs IB wins at scale
In-network compute SHARP (sum, min/max) SHARP (sum, min/max) Comparable
Memory model Unified address space RDMA (explicit) NVSwitch simpler
Cost per GPU Included in NVL72 system ~$5K per HCA + switch NVSwitch bundled
Power per GPU ~30W (switch share) ~15W (HCA) ~2x
Note: NVSwitch provides 36x more bandwidth than InfiniBand within a node/rack. InfiniBand connects multiple NVL72 racks into a cluster. Both are used together in large-scale systems.

The answer is: both. NVL72 provides the intra-rack fabric (72 GPUs, 1,800 GB/s per GPU). InfiniBand (or RoCE) connects multiple NVL72 racks into a cluster (thousands of GPUs, 50-400 GB/s per GPU). The application uses NCCL for both layers — NCCL selects NVSwitch for intra-rack communication and InfiniBand for inter-rack communication, transparently.

Summary

NVSwitch transforms the GPU interconnect from a bottleneck into an enabler. The progression from 8-GPU DGX nodes to 72-GPU NVL72 racks represents a shift from “multiple GPUs connected by a bus” to “a single distributed compute engine with unified memory.” The key metrics: 1,800 GB/s per GPU (2x HBM bandwidth), 130 TB/s bisection bandwidth, 1-hop switching through NVSwitch 4.0, and SHARP in-network reduction that offloads collective operations from GPU compute.

For model training, NVL72 means that tensor parallelism — which requires allreduce at every layer and is bandwidth-sensitive — can now scale to 72 GPUs without significant communication overhead. A model that previously required 8-way tensor parallelism on H100 (one DGX node) can now use 72-way tensor parallelism on NVL72, keeping each GPU’s per-layer computation small enough to remain compute-bound. This is the architectural enabler for trillion-parameter dense models: not just more GPUs, but more GPUs connected at sufficient bandwidth to act as one.