Part of Series GPU Hardware & AI Accelerators 16 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The Grace Hopper Superchip (GH200) packages an NVIDIA Grace ARM CPU and an H100 GPU on a single module, connected by NVLink Chip-to-Chip (NVLink-C2C) at 900 GB/s bidirectional bandwidth. For comparison, PCIe Gen5 x16 provides 64 GB/s — a 14x deficit. The DGX H100 system, which uses discrete H100 GPUs connected to AMD EPYC CPUs via PCIe, has CPU-GPU bandwidth that is 14x lower than Grace Hopper’s.

The bandwidth change is significant, but the real innovation is the memory model. Grace Hopper provides a hardware-coherent unified memory system where the CPU’s 480 GB LPDDR5X and the GPU’s 96 GB HBM3 appear as a single address space. The GPU can access CPU memory at NVLink-C2C bandwidth without explicit copies. The CPU can access GPU memory the same way. Page migration between CPU and GPU memory happens in hardware, transparently, based on access patterns. For LLM inference with models too large for HBM alone (70B+ parameters at FP16), the GPU can directly page through weights stored in CPU memory — slower than HBM, but 14x faster than PCIe.

This post covers the Grace CPU architecture, NVLink-C2C, the unified memory model, the memory map, programming model, and performance analysis for workloads that benefit from the expanded memory pool.

Grace CPU Architecture

ARM Neoverse V2 Cores

The Grace CPU is NVIDIA’s first data center CPU. It uses 72 ARM Neoverse V2 cores:

📊

Grace CPU Specifications

SpecificationGrace CPUAMD EPYC 9654 (Genoa)Intel Xeon w9-3595X
Architecture ARM Neoverse V2 Zen 4 Sapphire Rapids
Cores 72 96 56
Frequency (boost) 3.4 GHz 3.7 GHz 4.8 GHz
Memory LPDDR5X, 480 GB DDR5, up to 1.5 TB DDR5, up to 4 TB
Memory bandwidth 546 GB/s 461 GB/s 307 GB/s
L3 cache 117 MB 384 MB 105 MB
TDP 250 W 360 W 350 W
PCIe lanes N/A (uses NVLink-C2C) 128 PCIe 5.0 112 PCIe 5.0
Note: Grace uses LPDDR5X (soldered, low power) instead of standard DDR5 DIMMs. Higher bandwidth per watt, but fixed capacity at manufacturing.

LPDDR5X Memory

Grace uses LPDDR5X (Low Power DDR5X) — the same memory technology used in smartphones and laptops, but with server-grade capacity:

// Grace LPDDR5X configuration:
// 480 GB total (fixed at manufacturing — not expandable)
// 546 GB/s bandwidth (higher than DDR5 at similar power)
// 16 memory channels
// Power: ~40W for memory subsystem (vs ~60W for equivalent DDR5)
//
// Comparison to DDR5:
// DDR5-5600 8-channel: 358 GB/s, expandable to TBs
// LPDDR5X 16-channel: 546 GB/s, fixed at 480 GB
// LPDDR5X wins on bandwidth/watt, DDR5 wins on capacity flexibility

The Physical Interface

NVLink-C2C is a high-bandwidth die-to-die interconnect. Unlike standard NVLink (which connects GPUs through cables or PCB traces), NVLink-C2C connects two dies on the same module substrate:

// NVLink-C2C specifications:
// Bandwidth: 900 GB/s bidirectional (450 GB/s per direction)
// Latency: ~100 ns (much lower than PCIe ~1-2 us)
// Physical: short traces on module substrate (~10 mm)
// Signaling: PAM4, similar to NVLink 4.0
// Power: ~5 pJ/bit (very efficient due to short distance)
//
// For context:
// PCIe Gen5 x16: 64 GB/s bidirectional
// NVLink-C2C: 900 GB/s bidirectional
// Ratio: 14x
//
// PCIe latency: ~1-2 microseconds (through root complex)
// NVLink-C2C latency: ~100 nanoseconds (direct die-to-die)
// Ratio: 10-20x lower latency

CPU-GPU Interconnect Bandwidth Comparison

(GB/s (bidirectional))
PCIe Gen4 x16 32 GB/s — A100 DGX
32 GB/s (bidirectional)
PCIe Gen5 x16 64 GB/s — H100 DGX
64 GB/s (bidirectional)
CXL 2.0 x16 64 GB/s — same PHY as PCIe5
64 GB/s (bidirectional)
NVLink-C2C (Grace Hopper) 900 GB/s — 14x PCIe5
900 GB/s (bidirectional)
H100 HBM3 bandwidth 3,350 GB/s — for reference
3,350 GB/s (bidirectional)

The 14x bandwidth improvement over PCIe fundamentally changes which workloads are feasible:

// Scenario: LLM inference, model weights in CPU memory
//
// With PCIe Gen5 (DGX H100):
// Loading 70B FP16 model weights: 140 GB
// Time to transfer: 140 GB / 64 GB/s = 2.19 seconds
// This transfer happens EVERY inference step if weights don't fit in HBM
// Throughput: limited by 64 GB/s PCIe
//
// With NVLink-C2C (Grace Hopper):
// Same 140 GB transfer: 140 GB / 900 GB/s = 0.156 seconds
// 14x faster weight access from CPU memory
// Effective throughput: 900 GB/s (approaching HBM bandwidth)

Unified Memory Architecture

Hardware Coherence

Grace Hopper implements full hardware cache coherence between CPU and GPU. This means:

  1. Both CPU and GPU see the same data at the same address — no stale copies.
  2. When the GPU modifies data, the CPU’s cache is invalidated automatically (and vice versa).
  3. No explicit cache flush or invalidate operations needed in software.
// On Grace Hopper, unified memory works WITHOUT explicit copies:

// Allocate managed memory (accessible by both CPU and GPU)
float* data;
cudaMallocManaged(&data, size);

// CPU writes data
for (int i = 0; i < n; i++) {
    data[i] = compute_on_cpu(i);  // Data in LPDDR5X
}

// GPU reads data — NO cudaMemcpy needed
// Hardware coherence ensures GPU sees CPU's writes
gpu_kernel<<<grid, block>>>(data, n);

// GPU writes results — CPU sees them immediately after sync
cudaDeviceSynchronize();
float result = data[0];  // CPU reads GPU's output, coherent
ℹ️ Coherence Does Not Mean Equal Performance

While the CPU and GPU share a coherent view of memory, the performance of accessing data depends on where it physically resides. GPU accessing data in HBM: 3,350 GB/s. GPU accessing data in LPDDR5X (via NVLink-C2C): 900 GB/s. CPU accessing data in LPDDR5X: 546 GB/s. CPU accessing data in HBM (via NVLink-C2C): 900 GB/s but with higher latency. The coherence protocol ensures correctness, not performance parity.

Memory Map

Grace Hopper presents a unified physical address space:

// GH200 Memory Map (simplified):
// 0x0000_0000_0000 - 0x0077_FFFF_FFFF : LPDDR5X (480 GB, CPU-attached)
// 0x0078_0000_0000 - 0x008F_FFFF_FFFF : HBM3 (96 GB, GPU-attached)
//
// CUDA allocations:
// cudaMalloc()        → Allocates in HBM3 (GPU-local, fastest for GPU)
// cudaMallocManaged() → Allocates in either, migrates on access
// cudaMallocHost()    → Allocates in LPDDR5X (pinned, accessible by both)
//
// The GPU can load/store to any address in the unified space
// The CPU can load/store to any address in the unified space
// NVLink-C2C handles cross-chip accesses transparently

Page Migration

Grace Hopper supports hardware-assisted page migration. When the GPU frequently accesses a page in LPDDR5X, the hardware can migrate it to HBM for faster access:

// Control page migration policy
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
// Pages will reside in LPDDR5X unless GPU access triggers migration

cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, 0);
// Pages will reside in HBM (GPU device 0)

cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, 0);
// Hint: GPU device 0 will access this data
// Driver may create page table mappings proactively

// Prefetch to GPU memory
cudaMemPrefetchAsync(data, size, 0, stream);
// Explicitly migrate pages to GPU HBM before kernel launch
// This avoids page fault overhead during kernel execution
⚠️ Page Faults on First Access Are Expensive

If a GPU kernel accesses a page in LPDDR5X for the first time without prefetching, it triggers a page fault. The fault handler migrates the page to HBM (if preferred location is GPU) or creates a mapping to LPDDR5X. Either way, the first access to an unmapped page takes ~10-50 microseconds. Always prefetch data before kernel launch for latency-sensitive workloads.

Programming Model Changes

The Oversubscription Pattern

Grace Hopper’s key programming pattern: allocate more memory than HBM can hold, let the system manage data placement:

// 70B parameter LLM inference on Grace Hopper
// Model weights: 140 GB (FP16)
// KV cache: 32 GB (for batch of 64, 8K context)
// Activations: 4 GB
// Total: 176 GB — exceeds 96 GB HBM

// Strategy 1: Manual placement
// Pin weights in LPDDR5X, keep KV cache and activations in HBM
float* weights;
float* kv_cache;
float* activations;

cudaMallocHost(&weights, 140ULL * 1024 * 1024 * 1024);
cudaMalloc(&kv_cache, 32ULL * 1024 * 1024 * 1024);
cudaMalloc(&activations, 4ULL * 1024 * 1024 * 1024);

// GPU accesses weights via NVLink-C2C at 900 GB/s
// GPU accesses KV cache at HBM speed (3,350 GB/s)
// This is 14x faster than PCIe, enabling viable throughput

// Strategy 2: Managed memory with hints
cudaMallocManaged(&weights, 140ULL * 1024 * 1024 * 1024);
cudaMemAdvise(weights, 140ULL * 1024 * 1024 * 1024,
              cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
cudaMemAdvise(weights, 140ULL * 1024 * 1024 * 1024,
              cudaMemAdviseSetAccessedBy, 0);
// Weights stay in LPDDR5X, GPU access via NVLink-C2C
// No page migration (weights are read-only, no benefit to moving)

Prefill vs Decode Phase Optimization

// LLM inference on Grace Hopper: different strategies per phase

// PREFILL phase: compute-bound, processes full prompt
// Weight access pattern: each weight tensor accessed once per layer
// Weight bandwidth needed: model_size / prefill_time
// For 70B, prefill 4096 tokens: ~200ms → 140 GB / 0.2s = 700 GB/s
// 700 GB/s < 900 GB/s NVLink-C2C → weights from LPDDR5X is viable

// DECODE phase: memory-bound, generates one token at a time
// Weight bandwidth needed: model_size / decode_time_per_token
// For 70B, batch 1: ~30ms → 140 GB / 0.03s = 4,667 GB/s
// 4,667 GB/s >> 900 GB/s → weights from LPDDR5X is TOO SLOW
// Must batch enough requests: batch 8 → 140 GB / 0.24s = 583 GB/s ✓

// Solution: keep hot layers' weights in HBM, cold layers in LPDDR5X
// First 10 layers (most frequently accessed): HBM
// Remaining 70 layers: LPDDR5X
// Prefetch next layer's weights while computing current layer
📊

70B LLM Inference: Grace Hopper vs DGX H100 (Single GPU)

MetricGrace Hopper (GH200)DGX H100 (PCIe)Advantage
Total memory 576 GB (480+96) 80 GB HBM only 7.2x
Weight placement LPDDR5X (900 GB/s) Must shard across GPUs Single GPU viable
KV cache location HBM (3,350 GB/s) HBM (3,350 GB/s) Equal
Prefill throughput ~85% of HBM-only N/A (can't fit model) GH200 feasible
Decode (batch 1) ~27% of HBM-only N/A Slow, but functional
Decode (batch 32) ~78% of HBM-only N/A Viable for serving
GPU count needed 1 2-4 (tensor parallel) 1/2 to 1/4 GPUs
Note: Grace Hopper trades per-GPU throughput for fewer GPUs. For many inference workloads, using 1 GH200 instead of 2 H100s reduces total cost and eliminates tensor parallel communication overhead.

Performance Analysis

Bandwidth Hierarchy

// Grace Hopper bandwidth hierarchy:
// 1. GPU HBM3: 3,350 GB/s (fastest, 96 GB)
// 2. NVLink-C2C: 900 GB/s (cross-chip, any address)
// 3. CPU LPDDR5X: 546 GB/s (CPU-local bandwidth)
// 4. NVLink inter-GPU: 900 GB/s (to other GH200 modules)
//
// Key insight: GPU-to-LPDDR5X is 900 GB/s (NVLink-C2C limited)
// NOT 546 GB/s (CPU-local LPDDR5X bandwidth)
// The GPU's view of LPDDR5X is mediated by NVLink-C2C
// So the bottleneck for GPU accessing LPDDR5X is:
// min(900 GB/s NVLink-C2C, 546 GB/s LPDDR5X) = 546 GB/s
// Actually, because NVLink-C2C feeds into the Grace memory controller,
// the practical limit is closer to 450-500 GB/s for sustained GPU reads

Latency Comparison

// GPU kernel accessing different memory locations:
// HBM3 (local):     ~500 cycles = ~275 ns at 1.83 GHz
// LPDDR5X via C2C:  ~2000-3000 cycles = ~1.1-1.6 us
// Remote HBM via NVLink: ~5000-10000 cycles = ~2.7-5.5 us
//
// CPU accessing:
// LPDDR5X (local):  ~80-100 ns
// HBM via C2C:      ~200-400 ns

Microbenchmark: Memory Bandwidth Probe

__global__ void bandwidth_probe(const float4* __restrict__ data,
                                float* __restrict__ output,
                                size_t elements, int iterations) {
    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
    size_t stride = (size_t)gridDim.x * blockDim.x;
    float4 sum = make_float4(0, 0, 0, 0);

    for (int iter = 0; iter < iterations; iter++) {
        for (size_t i = tid; i < elements; i += stride) {
            float4 val = data[i];
            sum.x += val.x;
            sum.y += val.y;
            sum.z += val.z;
            sum.w += val.w;
        }
    }
    if (tid == 0) output[0] = sum.x + sum.y + sum.z + sum.w;
}

// Test 1: data in HBM (cudaMalloc)
// Expected: ~3,200 GB/s

// Test 2: data in LPDDR5X (cudaMallocHost, accessed from GPU)
// Expected: ~450-500 GB/s (NVLink-C2C bottleneck)

// Test 3: data managed, preferred on CPU (cudaMallocManaged + hint)
// Expected: ~400-450 GB/s (managed overhead)

GPU Read Bandwidth by Memory Location (GH200)

(GB/s)
HBM3 (cudaMalloc) 3,200 GB/s — peak GPU memory
3,200 GB/s
LPDDR5X via C2C (cudaMallocHost) 480 GB/s — NVLink-C2C
480 GB/s
LPDDR5X managed (cudaMallocManaged) 440 GB/s — managed overhead
440 GB/s
LPDDR5X first-touch (no prefetch) 50 GB/s — page fault dominated
50 GB/s

Grace Hopper in Multi-GPU Systems

GH200 NVL32

NVIDIA offers a multi-node configuration called GH200 NVL32 that connects 32 GH200 Superchips using NVLink:

// GH200 NVL32 configuration:
// 32 Grace Hopper Superchips
// 32 H100 GPUs + 32 Grace CPUs
// GPU-to-GPU NVLink: 900 GB/s per GPU (via NVSwitch 3.0)
// Total GPU memory: 32 × 96 GB HBM + 32 × 480 GB LPDDR5X
//                 = 3,072 GB HBM + 15,360 GB LPDDR5X = 18.4 TB
//
// Each GPU can access:
// - Its own 96 GB HBM at 3,350 GB/s
// - Its own CPU's 480 GB LPDDR5X at ~480 GB/s (via C2C)
// - Other GPUs' HBM at 900 GB/s (via NVLink)
// - Other CPUs' LPDDR5X at complex path (C2C + NVLink)

Programming Considerations

// Multi-GH200 programming:
// 1. Use NCCL for GPU-to-GPU collective operations (same as DGX)
// 2. Use NVLink-C2C for CPU-GPU data movement (no PCIe bottleneck)
// 3. Place model weights in LPDDR5X, KV cache in HBM

// Example: 180B model on 4x GH200
// Total HBM: 384 GB (4 × 96 GB)
// Total LPDDR5X: 1.92 TB (4 × 480 GB)
// Model weights (FP16): 360 GB → fits in LPDDR5X of 1 node
//   But distributed across 4 nodes for parallel access
//   90 GB weights per GPU (in LPDDR5X)
// KV cache: fits in HBM (varies with batch/context)
// Activations: in HBM

// Tensor parallelism across 4 GPUs via NVLink
// Weight loading from LPDDR5X via NVLink-C2C (per-GPU independent)
// No CPU-GPU transfer bottleneck (14x faster than PCIe)

When Grace Hopper Wins and When It Does Not

Grace Hopper Wins

  1. Models that exceed single-GPU HBM but fit in HBM + LPDDR5X: 70B-180B at FP16.
  2. Throughput-oriented inference with large batches: the higher batch size amortizes the lower weight-read bandwidth.
  3. CPU preprocessing + GPU inference pipelines: tokenization, sampling, and post-processing on Grace CPU with zero-copy handoff to H100 GPU.
  4. Power-constrained deployments: Grace’s ARM cores and LPDDR5X consume less power than equivalent x86 + DDR5.

Grace Hopper Does Not Win

  1. Latency-sensitive single-request inference: At batch size 1, weight-read bandwidth from LPDDR5X (480 GB/s) is 7x slower than HBM (3,350 GB/s). An H100 with the model fitting in HBM is faster.
  2. Training: Training requires reading and writing weights, gradients, and optimizer states repeatedly. The lower LPDDR5X bandwidth becomes a bottleneck.
  3. Models that fit entirely in HBM: If the model fits in 80 GB, Grace Hopper’s LPDDR5X provides no benefit — and the system costs more than a standalone H100.
The Batch Size Breakeven Point

For a 70B FP16 model on Grace Hopper, the decode phase needs sufficient batch size to amortize weight reads from LPDDR5X. The breakeven point where GH200 matches H100 throughput is approximately batch size 8-16. Below that, H100 wins on per-token latency. Above that, GH200 wins on total cost because it eliminates the need for multi-GPU tensor parallelism.

Summary

Grace Hopper fundamentally changes the memory architecture of GPU computing by replacing the PCIe bottleneck with NVLink-C2C. The 14x bandwidth increase means that CPU memory is no longer an afterthought — it is a viable extension of the GPU’s working set. For LLM inference, this enables single-GPU deployment of models that previously required 2-4 GPUs, reducing cost and eliminating tensor parallel communication overhead.

The unified memory model with hardware coherence simplifies programming: cudaMallocManaged works correctly without explicit copies, and data placement hints control where pages physically reside. The key decision for each data structure is where it should live: HBM for latency-sensitive, frequently-accessed data (KV cache, activations), LPDDR5X for large, read-mostly data (model weights), and managed memory for data with unpredictable access patterns.

Grace Hopper is not universally better than discrete GPUs. It trades per-GPU bandwidth for expanded capacity and lower system cost. The right workloads — large-model inference with sufficient batching — benefit significantly. The wrong workloads — latency-sensitive single-request inference, training, or models that fit in HBM — see no benefit or perform worse.