Part of Series CUDA Kernel Engineering 35 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

CUDA Unified Memory promises to eliminate explicit cudaMemcpy calls by letting CPU and GPU share a single address space. The runtime migrates pages automatically on first access. For a 1 GB array, explicit memory management requires two 50 ms transfers (H2D and D2H) and careful synchronization. Unified Memory requires zero explicit transfers — you allocate once with cudaMallocManaged and access from either side. But the first GPU access triggers 256,000 page faults (4 KB pages), each costing 5 microseconds, totaling 1.28 seconds of page fault overhead. The “simplified” version is 25x slower than explicit transfers.

The performance story is nuanced. Automatic page migration introduces overheads that explicit memory management avoids: page fault latency, migration bandwidth consumption, and unpredictable performance characteristics. This post covers when unified memory helps, when it hurts, how to optimize it with prefetch hints and memory advice, real benchmarks comparing explicit vs unified memory, and the Grace Hopper architecture where unified memory finally works at full speed.

What Unified Memory Is

The Programming Model

In traditional CUDA programming, the CPU and GPU have separate memory spaces. You allocate host memory, allocate device memory, copy data to the device, run a kernel, and copy results back. This explicit model gives you full control but requires careful management:

// Traditional CUDA: explicit memory management
void traditional_gemm(const float* h_A, const float* h_B, float* h_C,
                       int M, int N, int K) {
    float *d_A, *d_B, *d_C;
    size_t size_A = M * K * sizeof(float);
    size_t size_B = K * N * sizeof(float);
    size_t size_C = M * N * sizeof(float);

    // Allocate device memory (3 separate allocations)
    cudaMalloc(&d_A, size_A);
    cudaMalloc(&d_B, size_B);
    cudaMalloc(&d_C, size_C);

    // Copy inputs to device (2 explicit transfers)
    cudaMemcpy(d_A, h_A, size_A, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size_B, cudaMemcpyHostToDevice);

    // Run kernel
    gemm_kernel<<<grid, block>>>(d_A, d_B, d_C, M, N, K);

    // Copy result back to host
    cudaMemcpy(h_C, d_C, size_C, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

With Unified Memory, the same operation becomes:

// Unified Memory: single address space
void unified_gemm(int M, int N, int K) {
    float *A, *B, *C;

    // Single allocation, accessible from both CPU and GPU
    cudaMallocManaged(&A, M * K * sizeof(float));
    cudaMallocManaged(&B, K * N * sizeof(float));
    cudaMallocManaged(&C, M * N * sizeof(float));

    // Initialize on CPU (no copy needed)
    initialize_matrices(A, B, M, N, K);

    // Run kernel (data migrates automatically)
    gemm_kernel<<<grid, block>>>(A, B, C, M, N, K);
    cudaDeviceSynchronize();

    // Access result on CPU (no copy needed)
    float result = C[0];

    cudaFree(A);
    cudaFree(B);
    cudaFree(C);
}

The code is shorter and cleaner. No separate host/device pointers, no explicit copies, no manual lifetime management of two memory spaces. For complex applications with pointer-rich data structures (graphs, trees, linked lists), this simplification can be transformative — these structures are extremely difficult to deep-copy between address spaces.

How Page Migration Works

Under the hood, unified memory uses page-granularity migration. When the GPU accesses a page (typically 4 KB on Pascal, 64 KB on Volta and later) that is resident in CPU memory, a page fault occurs on the GPU. The CUDA driver traps this fault, migrates the page over PCIe or NVLink, and resumes the GPU thread.

The same happens in reverse: when the CPU accesses a page that is resident in GPU memory, a CPU page fault triggers migration back.

GPU Thread                     Page Table                    Memory
    |                              |                            |
    |-- Access addr 0xABCD ------->|                            |
    |                              |-- Page not in GPU memory --|
    |                              |-- Trigger GPU page fault --|
    |                              |                            |
    |   (GPU thread stalls)        |-- Migrate page from CPU -->|
    |                              |                     PCIe/NVLink
    |                              |-- Update page table -------|
    |                              |                            |
    |<-- Resume with data ---------|                            |
⚠️ Page Fault Latency Is the Killer

A single GPU page fault costs 20-50 microseconds depending on hardware generation and interconnect. During this time, the faulting warp (32 threads) is stalled. If many warps fault on different pages simultaneously, the GPU can effectively stall, with utilization dropping to near zero. This is the fundamental performance challenge of unified memory.

When Unified Memory Helps

Programmer Productivity

The most obvious benefit is code simplicity. For research code, prototypes, and applications where development velocity matters more than peak performance, unified memory eliminates an entire class of bugs (use-after-free on device, forgetting to copy back results, dangling device pointers).

Memory Oversubscription

On Pascal and later GPUs, unified memory supports oversubscription: you can allocate more managed memory than the GPU has physical memory. Pages are migrated on demand, with eviction of least-recently-used pages when GPU memory is full.

This is critical for workloads where the working set varies or is difficult to predict. Deep learning training with very large batch sizes, graph analytics on graphs larger than GPU memory, and scientific simulations with irregular memory access patterns all benefit from oversubscription.

// Allocate more memory than GPU has
// Works with Unified Memory, impossible with cudaMalloc
void oversubscription_example() {
    size_t gpu_mem = get_gpu_memory();  // e.g., 80 GB on H100
    size_t alloc_size = gpu_mem * 2;     // 160 GB -- exceeds GPU memory

    float* data;
    cudaMallocManaged(&data, alloc_size);

    // Only the active working set needs to fit in GPU memory
    // Pages are migrated on demand and evicted when not in use
    process_in_chunks<<<grid, block>>>(data, alloc_size / sizeof(float));
}
📊

Oversubscription Scenarios

ScenarioDataset SizeGPU MemoryOversubscription RatioFeasible with cudaMalloc?
Small graph 20 GB 80 GB 0.25x Yes
Large graph 120 GB 80 GB 1.5x No
Huge dataset 400 GB 80 GB 5x No
Multi-GPU (2x80 GB) 300 GB 160 GB 1.9x Needs manual partitioning
Note: Unified memory handles oversubscription transparently. Without it, the programmer must manually partition data and manage transfers.

Complex Data Structures

Pointer-rich data structures (trees, graphs, hash maps) cannot be deep-copied to the GPU with a simple cudaMemcpy. Each pointer in the structure points to a host address that is invalid on the device. Deep-copying requires traversing the entire structure, allocating device memory for each node, and patching all pointers.

With unified memory, pointers just work. A tree allocated with cudaMallocManaged can be traversed by GPU kernels without any pointer translation:

struct TreeNode {
    float value;
    TreeNode* left;
    TreeNode* right;
};

// Build tree on CPU using managed memory
TreeNode* build_tree(int depth) {
    TreeNode* node;
    cudaMallocManaged(&node, sizeof(TreeNode));
    node->value = random_value();
    if (depth > 0) {
        node->left = build_tree(depth - 1);
        node->right = build_tree(depth - 1);
    } else {
        node->left = nullptr;
        node->right = nullptr;
    }
    return node;
}

// GPU kernel can traverse the same tree
__global__ void sum_tree(TreeNode* root, float* result) {
    // Pointer chasing works -- each node migrates on demand
    float sum = 0;
    traverse_and_sum(root, &sum);
    *result = sum;
}

Multi-GPU Access Patterns

Unified memory simplifies multi-GPU programming by allowing any GPU to access any managed allocation. The runtime handles migration between GPUs automatically. For applications where data sharing patterns are dynamic or unpredictable, this eliminates significant programming complexity.

When Unified Memory Hurts

Page Fault Overhead

The core performance problem. Each page fault stalls the faulting warp for tens of microseconds. For workloads with good spatial locality (sequential array access), this is mitigated because faulting one page brings in all the data that nearby threads need. But for random access patterns, every access can trigger a separate fault.

📊

Page Fault Impact by Access Pattern

Access PatternFaults per MBEffective Bandwidthvs Explicit Copy
Sequential (coalesced) 16 (64KB pages) ~25 GB/s 80% of explicit
Strided (stride = 128B) ~256 ~4 GB/s 13% of explicit
Random (uniform) ~16,000 ~0.5 GB/s 1.6% of explicit
Explicit cudaMemcpy 0 (bulk transfer) ~31 GB/s (PCIe 5.0) Baseline
Note: Measured on A100 with PCIe 5.0. Sequential access amortizes fault cost across the page; random access hits worst case.

Effective Bandwidth: Unified Memory vs Explicit Transfer

(GB/s)
Explicit cudaMemcpy Baseline
31 GB/s
UM sequential access 80% of explicit
25 GB/s
UM sequential + prefetch 94% of explicit
29 GB/s
UM strided access
4 GB/s
UM random access 60x slower
0.5 GB/s

Unpredictable Latency

Explicit memory transfers have predictable timing: you know when the copy starts, how long it takes (bandwidth * size), and when it completes. Unified memory migrations happen on demand, making latency unpredictable. The first access to a new page is slow; subsequent accesses to the same page are fast. This makes performance profiling and optimization harder.

For latency-sensitive applications (real-time inference, interactive simulations), this unpredictability can be a dealbreaker.

Thrashing

If both CPU and GPU repeatedly access the same pages, the data ping-pongs between host and device memory. Each migration consumes interconnect bandwidth and stalls the accessing processor. This thrashing can degrade performance far below what either explicit management or pure CPU/GPU execution would achieve.

// Thrashing example: CPU and GPU alternately access the same data
void thrashing_example() {
    float* data;
    cudaMallocManaged(&data, N * sizeof(float));

    for (int iter = 0; iter < 1000; iter++) {
        // GPU writes (migrates pages to GPU)
        gpu_update<<<grid, block>>>(data, N);
        cudaDeviceSynchronize();

        // CPU reads (migrates ALL pages back to CPU)
        float sum = 0;
        for (int i = 0; i < N; i++) {
            sum += data[i];  // Each page faults back to CPU
        }
        // Next iteration: GPU faults all pages back again
    }
    // Each iteration: 2 * N_pages migrations = catastrophic overhead
}
🚨 Avoiding Thrashing

The most common unified memory performance pitfall is alternating CPU and GPU access to the same data without prefetch hints. If you find yourself in this pattern, either: (1) use cudaMemPrefetchAsync to batch migrations, (2) use cudaMemAdvise to set preferred location, or (3) switch to explicit memory management for that data.

Page Migration Granularity Mismatch

GPU page sizes for unified memory are 64 KB (on Volta and later). If your access pattern touches one float (4 bytes) in a 64 KB page, the entire 64 KB is migrated. This amplification factor of 16,384x means that sparse, scattered access patterns waste enormous bandwidth.

Prefetch Hints and Memory Advice

CUDA provides two APIs to mitigate unified memory overhead: cudaMemPrefetchAsync and cudaMemAdvise.

Prefetching

cudaMemPrefetchAsync triggers bulk page migration to a specified device, similar to cudaMemcpy but operating on managed memory. It moves pages before they are accessed, eliminating page fault stalls.

void optimized_unified_workflow(float* managed_data, size_t size) {
    // Initialize on CPU
    initialize_on_cpu(managed_data, size);

    // Prefetch to GPU before kernel launch
    cudaMemPrefetchAsync(managed_data, size, gpu_device_id, stream);

    // Kernel runs without page faults (data already on GPU)
    process_kernel<<<grid, block, 0, stream>>>(managed_data, size / sizeof(float));

    // Prefetch back to CPU before CPU access
    cudaMemPrefetchAsync(managed_data, size, cudaCpuDeviceId, stream);
    cudaStreamSynchronize(stream);

    // CPU access without page faults
    float result = managed_data[0];
}
📊

Prefetch Impact on Unified Memory Performance

OperationNo PrefetchWith PrefetchImprovement
1 GB sequential GPU access 42 ms 12 ms 3.5x faster
1 GB random GPU access 1800 ms 12 ms 150x faster
256 MB CPU readback 38 ms 9 ms 4.2x faster
Repeated GPU-CPU round trips (10x) 420 ms 120 ms 3.5x faster
Note: Measured on A100 80GB, PCIe Gen4. Prefetch eliminates fault overhead by performing bulk migration.

With prefetching, unified memory performance approaches explicit cudaMemcpy performance because the migration mechanism is essentially the same — bulk DMA transfer. The remaining overhead is the address translation through the managed memory page table.

💡 If You Are Prefetching Everything, Why Use Unified Memory?

A common question: if optimal unified memory performance requires prefetching all data, what advantage does it have over explicit cudaMemcpy? The answers are: (1) simpler code with a single pointer per allocation, (2) automatic handling of oversubscription and eviction, (3) easier multi-GPU sharing, and (4) the ability to have a fast path (prefetch) with a correct fallback (on-demand migration) for data whose access pattern is hard to predict.

Memory Advice

cudaMemAdvise provides hints about access patterns without triggering migration:

void apply_memory_advice(float* data, size_t size) {
    // Data will mostly be read by GPU (create read-only mapping)
    cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, gpu_device_id);
    // After this, both CPU and GPU can read without migration.
    // Only writes trigger migration (copy-on-write semantics).

    // Prefer GPU memory as the home location
    cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, gpu_device_id);
    // Pages will migrate to GPU on first access and stay there.
    // CPU access creates a zero-copy mapping (access over PCIe, no migration).

    // Tell runtime that CPU will also access this data
    cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, cudaCpuDeviceId);
    // Runtime may create direct mappings to avoid migration.
}

The SetReadMostly hint is particularly powerful for data that is written once and read many times (model weights, lookup tables, constant arrays). It allows both processors to have local copies, avoiding migration entirely at the cost of coherence (writes invalidate all copies and trigger re-migration).

📊

Memory Advice Impact on Common Patterns

PatternNo AdviceSetReadMostlySetPreferredLocationBest Strategy
Read-only lookup table 12 ms 0.3 ms 0.4 ms SetReadMostly
Write-once, read-many 15 ms 0.3 ms 0.5 ms SetReadMostly
GPU-primary, rare CPU read 8 ms N/A 2 ms SetPreferredLocation(GPU)
CPU-primary, rare GPU read 25 ms N/A 5 ms SetPreferredLocation(CPU)
Note: Times for 256 MB data on A100, PCIe Gen4. SetReadMostly avoids migration for read-only data.

Heterogeneous Memory Management (HMM)

Starting with CUDA 12.2 and Linux kernel 6.1+, NVIDIA introduced HMM (Heterogeneous Memory Management) support for CUDA. HMM integrates GPU page fault handling with the Linux kernel’s virtual memory subsystem, allowing regular malloc-allocated memory to be accessed by GPU kernels.

What HMM Enables

// Before HMM: only cudaMallocManaged memory is accessible from GPU
float* data = (float*)malloc(N * sizeof(float));
kernel<<<grid, block>>>(data);  // SEGFAULT or incorrect results

// With HMM: regular allocations work on GPU
float* data = (float*)malloc(N * sizeof(float));
kernel<<<grid, block>>>(data);  // Works! Pages fault-migrate as needed

HMM extends the unified memory model to all system memory, not just memory allocated with cudaMallocManaged. This is a significant step toward true unified addressing where the programmer never needs to think about which memory a pointer came from.

HMM Performance Characteristics

HMM uses the same page migration mechanism as cudaMallocManaged, so the performance characteristics are similar: page faults on first access, bulk migration with prefetch, and the same advice APIs apply. The main difference is that HMM-managed pages use the OS page table rather than the CUDA-specific page table, which can add slight overhead for page table walks.

📊

HMM vs cudaMallocManaged Performance

OperationcudaMallocManagedHMM (malloc)Overhead
First-touch migration (1 GB) 32 ms 35 ms +9%
Prefetched access (1 GB) 11 ms 12 ms +9%
Kernel execution (data on GPU) Identical Identical 0%
Page fault latency ~30 us ~35 us +17%
Note: HMM has slightly higher overhead due to OS page table integration. Once data is on GPU, performance is identical.
ℹ️ HMM Adoption

HMM is still maturing. As of 2025, most production CUDA applications continue to use cudaMallocManaged for unified memory rather than HMM. The primary use case for HMM is legacy applications with large malloc-based codebases that need GPU acceleration without a full memory management rewrite.

Real Benchmarks: Explicit vs Unified Memory

Let us look at concrete benchmarks across several common workloads to understand where unified memory stands relative to explicit management.

GEMM (Matrix Multiplication)

GEMM is the best case for explicit memory: large, contiguous transfers followed by compute-intensive processing. Unified memory can approach this with prefetching.

📊

GEMM Performance: Explicit vs Unified Memory (A100 80GB)

Matrix SizeExplicit (TFLOPS)UM No Prefetch (TFLOPS)UM + Prefetch (TFLOPS)UM+Prefetch vs Explicit
1024x1024 12.8 8.2 12.1 95%
4096x4096 156 142 153 98%
8192x8192 278 265 275 99%
16384x16384 298 290 296 99%
Note: Large GEMMs are compute-bound, so the migration overhead is a negligible fraction of total time. Small GEMMs expose migration latency.

For large GEMMs, unified memory with prefetching reaches 98-99% of explicit memory performance. The small overhead is from the managed memory page table lookup. Without prefetching, performance drops 5-35% depending on matrix size because page faults serialized at kernel start.

Stencil Computation (Memory-Bound)

Stencil operations (convolutions, finite differences) are memory-bandwidth-bound and access data sequentially. This is a moderate case for unified memory.

📊

3D Stencil Performance (512^3 grid, A100)

ApproachBandwidth (GB/s)Time (ms)vs Explicit
Explicit cudaMemcpy 1420 5.8 Baseline
UM, no hints 890 9.2 63%
UM + prefetch 1350 6.1 95%
UM + prefetch + SetReadMostly (coefficients) 1390 5.9 98%

Sparse Graph Traversal (Irregular Access)

Graph algorithms with random access patterns represent the worst case for unified memory without hints but the best case for its oversubscription capabilities.

📊

BFS on Large Graph (100M vertices, 1B edges)

ApproachGraph in GPU Memory?Time (s)Notes
Explicit (graph fits) Yes 2.1 Baseline
UM, no hints (graph fits) Yes (after migration) 3.8 Page faults on first BFS
UM + prefetch (graph fits) Yes 2.3 Near-explicit performance
UM, oversubscribed (2x GPU mem) Partially 8.5 Pages evict and reload
Explicit (graph too large) No N/A Cannot run
Note: Oversubscription enables running workloads that would be impossible with explicit memory, at a performance cost.

Unified Memory Performance Relative to Explicit (A100)

(% of explicit performance)
Large GEMM + prefetch Near-identical
99 % of explicit performance
Stencil + prefetch
95 % of explicit performance
Large GEMM, no hints
85 % of explicit performance
Stencil, no hints
63 % of explicit performance
Graph BFS, no hints
55 % of explicit performance
Random access, no hints Unacceptable
8 % of explicit performance

Summary of When to Use Which

💡 Decision Framework

Use explicit memory when: you know the access pattern at compile time, data fits in GPU memory, and you need maximum performance (production inference, HPC simulation).

Use unified memory with prefetch when: you want cleaner code with near-explicit performance, or you need multi-GPU data sharing.

Use unified memory without hints when: you are prototyping, the access pattern is unpredictable, or the data exceeds GPU memory (oversubscription).

Avoid unified memory when: you have random access patterns with no temporal locality, or the workload alternates CPU/GPU access rapidly (thrashing risk).

The Grace Hopper Unified Memory Architecture

NVIDIA’s Grace Hopper Superchip (GH200), released in 2023, represents a fundamental architectural shift for unified memory. Instead of connecting CPU and GPU over PCIe (31 GB/s) or even NVLink (600 GB/s between GPUs), Grace Hopper places an ARM-based CPU (Grace) and a Hopper GPU (H100) on the same module with a coherent NVLink-C2C interconnect at 900 GB/s.

What Changes with Grace Hopper

Coherent shared memory: Grace Hopper provides hardware cache coherence between CPU and GPU memory. This means the CPU can access GPU memory (and vice versa) without page migration — the hardware maintains coherence at cache-line granularity, similar to how multi-socket CPUs maintain coherence.

Combined memory pool: The Grace CPU has up to 480 GB of LPDDR5X memory, and the Hopper GPU has 96 GB of HBM3. With NVLink-C2C, these form a single 576 GB coherent memory pool. The GPU can access CPU memory at 450+ GB/s (compared to ~31 GB/s over PCIe).

No page faults for cross-access: Because the interconnect supports coherent access, the GPU can directly load from CPU memory without triggering a page fault and migration. The access is slower than local HBM3 (450 GB/s vs 3.35 TB/s) but far faster than PCIe-based migration.

📊

Memory Access Performance: PCIe vs NVLink-C2C (Grace Hopper)

Access TypePCIe Gen5NVLink-C2CImprovement
CPU to GPU memory bandwidth 31 GB/s 450 GB/s 14.5x
GPU to CPU memory bandwidth 31 GB/s 450 GB/s 14.5x
GPU local HBM3 bandwidth 3.35 TB/s 3.35 TB/s Same
Page fault latency ~30 us ~2 us (coherent access) 15x lower
Effective unified memory bandwidth ~25 GB/s ~400 GB/s 16x
Note: NVLink-C2C provides coherent, low-latency cross-access that eliminates most of the unified memory performance penalty.

Impact on Unified Memory Programming

Grace Hopper makes unified memory a first-class programming model rather than a convenience abstraction with performance caveats. The programming model is the same (cudaMallocManaged, prefetch, advice), but the performance gap between explicit and unified memory shrinks dramatically because:

  1. Page migration is faster (900 GB/s interconnect)
  2. Coherent access eliminates migration for many patterns (no fault, just slower access)
  3. The combined memory pool means oversubscription is rare (576 GB total)

Unified Memory Performance: PCIe System vs Grace Hopper

(% of local-memory performance)
PCIe: Large GEMM + prefetch
95 % of local-memory performance
GH200: Large GEMM + prefetch
99 % of local-memory performance
PCIe: Random access, no hints
8 % of local-memory performance
GH200: Random access, no hints 5.6x better than PCIe
45 % of local-memory performance
PCIe: Oversubscribed graph BFS
25 % of local-memory performance
GH200: Large-pool graph BFS Fits in 576 GB pool
85 % of local-memory performance
Grace Hopper Changes the Calculus

On PCIe systems, the recommendation is “use explicit memory unless you have a good reason not to.” On Grace Hopper, the recommendation flips: “use unified memory unless you need the last 5% of performance.” The coherent interconnect eliminates the performance cliff that made unified memory risky on PCIe systems.

Unified Memory for AI Workloads

Training

During training, the memory access pattern is predictable: forward pass reads weights and activations, backward pass reads and writes gradients. Explicit memory management is standard and well-optimized by frameworks like PyTorch and JAX.

Unified memory can help in specific training scenarios:

  • Large model training with CPU offloading: ZeRO-Offload and similar techniques move optimizer states to CPU memory. Unified memory simplifies the data movement code, and on Grace Hopper, the high-bandwidth interconnect makes CPU-GPU transfers fast enough to hide behind computation.
  • Dynamic batch sizes: When batch size varies based on sequence length (common in NLP), memory requirements are unpredictable. Unified memory’s oversubscription handles this gracefully.

Inference

For inference, the primary concern is KV cache management. Long-context models at 128K tokens can have KV caches exceeding 20 GB. Unified memory enables:

  • KV cache overflow to CPU memory: When GPU memory is exhausted, unified memory can transparently spill KV cache pages to CPU memory. Frequently-accessed pages stay on GPU; cold pages reside on CPU.
  • Speculative decoding: Speculative decoding runs a smaller draft model and a larger verification model. With unified memory, both models can share the KV cache without explicit copies.
📊

Unified Memory for LLM Inference

ScenarioExplicit MemoryUnified MemoryBenefit
Standard inference (model fits) Best throughput ~97% of explicit (with prefetch) Simpler code
KV cache overflow (128K context) Must implement paging manually Automatic overflow to CPU Major simplification
Multi-model serving Complex memory partitioning Shared memory pool, on-demand Dynamic allocation
Grace Hopper inference Marginal benefit Near-native performance Best of both worlds

Performance Optimization Checklist

For practitioners using unified memory, here is a prioritized optimization checklist:

  1. Always prefetch before compute-intensive kernels. This single step closes most of the performance gap with explicit memory.

  2. Use SetReadMostly for immutable data (model weights, lookup tables, configuration data). This eliminates migration for read-only access patterns.

  3. Use SetPreferredLocation to keep data near its primary consumer. If data is mostly used by GPU, set preferred location to GPU. CPU reads will then use zero-copy access (slower but no migration).

  4. Avoid CPU-GPU thrashing. If you must alternate access, batch the accesses: let GPU process a large chunk, prefetch results to CPU, let CPU process, prefetch back to GPU.

  5. Profile with nvprof or Nsight Systems. Look for Unified Memory events in the timeline. High page fault counts indicate missing prefetch hints.

  6. Consider explicit management for hot paths. Unified memory and explicit memory can coexist. Use unified memory for complex data structures and explicit memory for performance-critical bulk data.

// Mixed approach: explicit for hot path, unified for complex structures
void mixed_approach() {
    // Hot path: explicit management for maximum bandwidth
    float *d_weights;
    cudaMalloc(&d_weights, weight_size);
    cudaMemcpy(d_weights, h_weights, weight_size, cudaMemcpyHostToDevice);

    // Complex structure: unified memory for convenience
    GraphNode* graph;
    cudaMallocManaged(&graph, graph_size);
    build_graph_on_cpu(graph);
    cudaMemPrefetchAsync(graph, graph_size, gpu_device_id);

    // Kernel uses both
    process<<<grid, block>>>(d_weights, graph);
}

Conclusion

CUDA Unified Memory is not a binary choice between “use it” and “do not use it.” It is a tool with well-understood performance characteristics that varies dramatically based on access patterns, hardware generation, and whether you apply optimization hints.

On PCIe-connected systems, unified memory with prefetching reaches 95-99% of explicit memory performance for regular access patterns. Without hints, it can drop to single-digit percentages of explicit performance for random access. The decision framework is clear: use unified memory for programmer productivity and oversubscription, but always add prefetch hints for performance-critical paths.

On Grace Hopper, the calculus shifts fundamentally. The 900 GB/s coherent interconnect reduces the cross-access penalty to a fraction of what it is on PCIe, making unified memory the default choice for most workloads. This architectural direction — tighter CPU-GPU integration with coherent memory — will likely continue in future hardware generations, gradually making the explicit memory management model a legacy approach reserved for extreme performance requirements.

The practical advice is straightforward: start with unified memory for development velocity, profile to identify bottlenecks, add prefetch hints where they matter, and fall back to explicit management only for the specific allocations where the last few percent of performance justifies the code complexity.