Part of Series GPU Hardware & AI Accelerators 31 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

If you have spent time optimizing code for CPUs — tuning cache access patterns, aligning data to cache lines, profiling branch mispredictions — you might expect GPU optimization to be a straightforward extension of the same principles. It is not. GPUs and CPUs solve the memory latency problem in fundamentally different ways, and the optimization strategies that dominate on one architecture are often irrelevant or counterproductive on the other.

This post is the bridge between two worlds. If you come from embedded systems or high-performance CPU programming, it will show you why GPU memory works the way it does and what changes in your mental model. If you come from GPU programming, it will show you the CPU heritage that shaped the problems GPUs were designed to solve. Either way, the goal is the same: build an architectural intuition deep enough that you can predict which optimizations matter before you reach for a profiler.

Part I: The CPU Memory Hierarchy

The Memory Wall

The central problem of CPU design since the 1990s has been the memory wall. Processor clock speeds and pipeline depths increased exponentially through the 1980s and 1990s, but DRAM access latency improved only modestly. By 2000, a single cache miss to main memory cost roughly 200 processor cycles. By 2025, that ratio has grown to 300-400 cycles on high-frequency desktop parts. The CPU can execute hundreds of useful instructions in the time it takes to fetch one cache line from DRAM.

The CPU’s answer to this problem is a hierarchy of progressively larger and slower caches, combined with sophisticated prediction and prefetching hardware that tries to ensure data is already in a fast cache before the program needs it.

📊

CPU Cache Hierarchy (Intel Xeon Sapphire Rapids, 2023)

LevelSize (per core)LatencyBandwidth (per core)Miss Penalty vs Previous
L1 Data 48 KB ~4 cycles (~1.2 ns) ~300 GB/s --
L2 2 MB ~12 cycles (~4 ns) ~100 GB/s ~8 cycles
L3 (shared) 1.875 MB/core (up to 105 MB total) ~40 cycles (~13 ns) ~50 GB/s ~28 cycles
DRAM (DDR5) GBs ~200+ cycles (~60-80 ns) ~25 GB/s ~160 cycles
Note: Bandwidth is per-core sustained for sequential access. Actual numbers vary by SKU, memory configuration, and access pattern.

Each level trades capacity for speed. L1 is tiny (48 KB) but delivers data in 4 cycles. DRAM is enormous but 50x slower. The system works because of a statistical property of real programs: most memory accesses cluster around a small working set (temporal locality), and programs that access one address tend to access nearby addresses shortly afterward (spatial locality). When these properties hold, the cache hierarchy is nearly invisible — the program runs at close to L1 speed despite having gigabytes of data.

When they do not hold — when access patterns are random, or the working set exceeds cache capacity — performance collapses.

CPU Memory Throughput by Access Pattern

(GB/s)
Sequential (in L1)
300 GB/s
Sequential (in L2)
100 GB/s
Sequential (in L3)
50 GB/s
Sequential (DRAM)
25 GB/s
Random (DRAM) 17x slower than sequential DRAM
1.5 GB/s

Random DRAM access achieves roughly 1.5 GB/s per core — 200x slower than L1 sequential throughput. This single measurement explains an enormous number of performance phenomena: why hash tables with large, cold backing stores are slow, why linked lists are hostile to modern hardware, why column-oriented databases outperform row-oriented ones for analytics queries, and why sorting data before processing it can be faster even when the sort itself is expensive.

Cache Lines: The Unit of Transfer

CPU caches do not operate on individual bytes. They operate on cache lines, which are 64 bytes on all modern x86 and ARM processors. When the CPU needs a byte that is not in cache, it fetches the entire 64-byte line containing that byte. This has two major consequences.

Spatial prefetch benefit. If you access byte 0 of a struct, bytes 1 through 63 are already cached. Sequential iteration through an array is fast because each cache line fetch provides 64 bytes of useful data, and the hardware prefetcher detects the sequential pattern and begins fetching the next lines before you need them.

False sharing penalty. If two threads on different cores modify variables that happen to reside on the same 64-byte cache line, the hardware cache coherency protocol (MESI/MESIF/MOESI) must bounce that line between cores on every write. Each bounce costs 40-100 cycles. What looks like independent, parallel work becomes serialized.

📊

False Sharing Impact (4 Cores, Independent Counters)

LayoutThroughput (M ops/s)Relative Performance
Separate cache lines (64-byte aligned) 400 1x (baseline)
Same cache line (adjacent uint64_t) 12 0.03x (33x slower)
Padded to separate lines 380 0.95x (recovered)
Note: Each of 4 threads increments its own private counter in a tight loop. False sharing causes coherency traffic that serializes updates.

The fix is simple: pad per-thread data to cache line boundaries using alignas(64) or manual padding. But you have to know the problem exists, and you have to know the cache line size. This is the kind of architectural detail that CPU optimization demands and GPU optimization largely does not — because GPUs solve the concurrency problem differently.

Prefetching: Predicting the Future

Modern CPUs contain sophisticated hardware prefetchers that detect memory access patterns and issue speculative loads ahead of the program’s actual requests. The L2 stride prefetcher, for example, can detect strided access patterns (every NN-th cache line) and begin loading future lines into L2 before the program asks for them. The L1 prefetcher handles simpler sequential (next-line) patterns.

When prefetching works, it hides DRAM latency almost completely. A sequential scan of a large array, once the prefetcher locks onto the pattern, runs at close to the sustained DRAM bandwidth of the system (25-50 GB/s per core on modern hardware). The program never stalls because the next cache line is always ready.

When prefetching fails — because the access pattern is irregular, or because the pattern changes too quickly for the prefetcher to adapt — every L3 miss becomes a full DRAM latency penalty. This is why pointer-chasing data structures (linked lists, trees, graphs) are so expensive: the address of the next node depends on the data loaded from the current node, so the prefetcher cannot predict the access pattern.

Why Sorting Helps

A common CPU optimization trick: sort your data by the key you will use to look up related data. If you have an array of objects and you iterate through them accessing a field that points to scattered memory locations, sorting by that field clusters the target addresses. This converts a random access pattern into a roughly sequential one, enabling the prefetcher and reducing L3 miss rates by 10x or more. On GPUs, this idea has a direct analogue in coalescing, which we will discuss later.

Software prefetching (__builtin_prefetch on GCC/Clang, _mm_prefetch on MSVC) gives the programmer explicit control. You can issue a prefetch 100-200 iterations ahead in a loop, so the data is in cache by the time you need it. This is effective for strided or semi-predictable access patterns that the hardware prefetcher cannot detect. But it must be tuned carefully: prefetch too early and the data may be evicted before use; prefetch too late and you still stall. On GPUs, software prefetching is not a concept — the hardware uses a completely different mechanism to hide latency.

Branch Prediction: Speculating on Control Flow

CPUs execute instructions through deep pipelines — 15 to 25 stages on modern designs. When the pipeline encounters a conditional branch (if, loop exit, switch), it does not know which path to take until the condition is evaluated, which may depend on data still being loaded from memory. Stalling the pipeline for every branch would waste 15-25 cycles per branch.

Instead, CPUs use branch predictors: hardware structures that guess which way each branch will go, based on the history of that branch and correlated branches. Modern branch predictors (TAGE-based predictors on Intel, perceptron-based predictors on AMD) achieve prediction accuracies above 97% for typical workloads. When the prediction is correct, the pipeline continues at full speed. When the prediction is wrong (a misprediction), the pipeline must be flushed and restarted from the correct path, costing 15-25 cycles.

📊

Branch Prediction Impact

ScenarioBranch PatternMisprediction RateThroughput Impact
Sorted array filter Long runs of taken/not-taken less than 1% Baseline (fast)
Unsorted array filter Random taken/not-taken ~50% 2-5x slower
Branchless (CMOV) No branches N/A ~1.2x baseline
Note: Filtering an array: if (arr[i] > threshold) sum += arr[i]. Sorted arrays create predictable branches. Unsorted arrays create unpredictable ones.

This is the famous “sorting makes branches predictable” effect. When you iterate over a sorted array and test each element against a threshold, the branch is taken for a long run of elements, then not-taken for a long run. The branch predictor learns this pattern instantly. On an unsorted array, the branch is effectively random, and the predictor has a roughly 50% miss rate.

The CPU devotes enormous transistor area and design complexity to branch prediction. It is one of the most power-hungry structures on the chip. And it exists for one reason: to keep a single thread’s pipeline full, minimizing latency for sequential code. GPUs take an entirely different approach.

Working Set Size: The Cache Cliff Effect

Performance does not degrade gradually as your data grows. It drops sharply at each cache boundary:

CPU Throughput vs Working Set Size (Sequential Read)

line
Metric 4 KB16 KB64 KB256 KB1 MB4 MB16 MB64 MB256 MB
Read throughput (GB/s)
280
270
250
95
90
48
45
24
24

Three cliffs are visible. The L1-to-L2 cliff around 48 KB, the L2-to-L3 cliff around 2 MB, and the L3-to-DRAM cliff around 30-100 MB (depending on core count and L3 allocation). Each cliff represents a 2-5x throughput drop. Knowing where your working set sits relative to these boundaries is the single most important performance insight for CPU-bound code.

💡 Measuring Cache Performance on Linux

Use perf stat -e L1-dcache-load-misses,l2_rqsts.miss,LLC-load-misses to measure cache miss rates at each level. An L1 miss rate above 5% or an LLC miss rate above 1% usually indicates optimization opportunity. perf c2c is invaluable for detecting false sharing between cores.

CPU Optimization Summary

The CPU memory hierarchy is designed around a single organizing principle: minimize the latency experienced by each individual thread. Every major feature — deep caches, hardware prefetching, branch prediction, out-of-order execution, speculative loads — exists to keep one thread running as fast as possible. The transistor budget is enormous: a modern CPU core dedicates roughly 80% of its die area to structures that predict, prefetch, reorder, and speculate. Only about 20% is actual execution units.

This design is optimal for workloads with complex, branchy control flow, pointer-heavy data structures, and moderate parallelism. Operating systems, databases, compilers, game engines, web servers — these all benefit enormously from the CPU’s latency-optimizing architecture. But it is a poor match for massively parallel, regular workloads. That is where GPUs diverge.

Part II: The GPU Memory Hierarchy

A Different Design Philosophy

GPUs were born from a different constraint. Graphics rendering requires applying the same operation (a pixel shader, a vertex transform) to millions of independent data elements. There is no data dependency between pixels. The workload is embarrassingly parallel, but each individual operation may need to access texture memory (slow, like DRAM). If a GPU tried to optimize each pixel’s latency the way a CPU optimizes each thread’s latency, it would need millions of branch predictors, millions of prefetch units, and millions of reorder buffers. That is not feasible.

Instead, GPUs make a radical tradeoff: they accept high per-thread latency and compensate with massive parallelism. Rather than predicting and prefetching to avoid stalls, the GPU simply switches to a different group of threads whenever the current group is waiting for memory. If you have enough threads in flight, there is always one ready to execute, and the execution units never go idle.

This single architectural decision — hide latency through parallelism rather than prediction — explains almost every difference between CPU and GPU memory optimization.

⚠️ The Fundamental Tradeoff

CPUs optimize for latency: make each thread finish as fast as possible. GPUs optimize for throughput: maximize the total work completed per second across all threads, even if each individual thread is slow. These are not just different points on the same spectrum. They lead to fundamentally different hardware designs and fundamentally different optimization strategies.

GPU Memory Hierarchy Overview

A modern GPU (NVIDIA H100) has its own memory hierarchy, but the sizes, latencies, and design intent are different from a CPU:

📊

GPU Memory Hierarchy (NVIDIA H100 SXM)

LevelSize (per SM)Latency (cycles)BandwidthScope
Registers 256 KB 1 cycle ~100+ TB/s (aggregate) Per thread
Shared Memory Up to 228 KB ~23 cycles ~3+ TB/s (aggregate) Per thread block
L1 Cache 256 KB (shared pool with SMEM) ~30 cycles ~1.5-2 TB/s Per SM
L2 Cache 50 MB (total) ~200 cycles ~12 TB/s Entire GPU
HBM3 (Global Memory) 80 GB ~400-600 cycles 3,350 GB/s Entire GPU + Host
Note: Register bandwidth is effectively unlimited since each thread accesses its own registers every cycle. Shared memory and L1 share the same physical SRAM.

At first glance, this looks similar to a CPU: registers are fast, caches are medium, main memory is slow. But the design intent is completely different.

CPU L1 cache: 48 KB, automatic, managed by hardware. The programmer never explicitly moves data into L1. The hardware prefetcher and cache replacement policy handle everything. The programmer’s job is to arrange access patterns so the automatic system works well.

GPU shared memory: up to 228 KB, explicit, managed by the programmer. The programmer explicitly loads data from global memory into shared memory, explicitly synchronizes threads, and explicitly reads from shared memory. It is a software-managed scratchpad, not a transparent cache. This gives the programmer precise control over what data is resident and when — at the cost of significantly more complex code.

CPU L3 cache: tens of MB, shared across cores, critical for multi-threaded workloads. The L3 acts as a victim cache and coherency domain, reducing the cost of cross-core data sharing. CPU optimization often focuses on keeping shared data structures within L3.

GPU L2 cache: 50 MB, shared across all SMs, but not a primary optimization target. The GPU L2 helps with redundant global memory reads but is not something programmers typically optimize for directly. The primary memory optimization on GPUs is coalescing (reducing global memory transactions) and shared memory tiling (keeping data in SRAM).

No Branch Prediction: Warps Instead

Here is where the architectures diverge most dramatically. CPUs spend enormous die area on branch prediction to keep a single thread’s pipeline full. GPUs have no branch predictor at all (or at most trivial ones for control flow within a warp). Instead, GPUs use a mechanism called warp-level execution (NVIDIA terminology; AMD calls them wavefronts).

A warp is a group of 32 threads that execute the same instruction at the same time (SIMT — Single Instruction, Multiple Threads). When a warp encounters a branch:

  1. If all 32 threads take the same path, execution continues normally — no penalty.
  2. If some threads take one path and others take the other, the warp executes both paths serially, masking off the threads that should not execute each path. This is called warp divergence.

Warp Divergence Impact on Effective Throughput

(% of peak)
No divergence (all 32 threads same path)
100 % of peak
50% divergence (16/16 split) Both paths execute, half threads masked
50 % of peak
Worst case (32-way divergence) Each thread takes a unique path
3.1 % of peak

Warp divergence does not cause a pipeline flush or a misprediction penalty. It causes reduced SIMT utilization — fewer threads are doing useful work each cycle. In the worst case, if every thread in a warp takes a different path through a 32-way switch statement, the warp executes all 32 paths serially, achieving only 132\frac{1}{32} throughput.

The optimization implications are the opposite of CPU optimization:

  • CPU: Make branches predictable (sort data, use consistent patterns). The branch predictor will handle the rest.
  • GPU: Eliminate branches entirely, or ensure all threads in a warp take the same path. Replace if/else with arithmetic (branchless code). Reorganize data so that threads with similar control flow are in the same warp.
Branchless GPU Programming

On CPUs, branchless code (using conditional moves, arithmetic masks) is an optimization for unpredictable branches. On GPUs, branchless code is the default programming style. Instead of if (condition) x = a; else x = b;, GPU programmers write x = condition * a + (1 - condition) * b; or use ternary operators that the compiler can lower to predicated instructions. The cost of a branch on a GPU is not a misprediction penalty — it is halved (or worse) throughput from divergence.

No Deep Prefetching: Massive Parallelism Instead

CPU prefetchers detect access patterns and speculatively load data before the program needs it. GPUs have rudimentary prefetching at best (some texture cache prefetching for 2D spatial patterns). Instead, GPUs hide memory latency through occupancy — having many more warps resident on each SM than can execute simultaneously.

When one warp issues a global memory load and must wait 400-600 cycles for the result, the SM’s warp scheduler immediately switches to another warp that has its data ready. If there are enough warps in flight (typically 8-16 active warps per SM at minimum), the memory latency is completely hidden behind useful compute from other warps.

GPU Kernel Performance vs Occupancy

line
Metric 12.5%25%37.5%50%62.5%75%87.5%100%
Memory-bound kernel (% of peak bandwidth)
15
35
55
72
82
90
95
98
Compute-bound kernel (% of peak FLOPS)
45
70
82
88
91
93
94
95

For memory-bound kernels, the relationship between occupancy and performance is steep. At 12.5% occupancy (only a few warps per SM), the SM is idle most of the time, waiting for memory. At 50% or higher, enough warps are available to keep the execution units busy during memory stalls. Compute-bound kernels are less sensitive because they spend less time waiting for memory.

This replaces the CPU’s prefetch mechanism with a completely different contract between hardware and programmer:

  • CPU contract: “Arrange your access patterns so the hardware prefetcher can predict them, and I will hide DRAM latency for you.”
  • GPU contract: “Give me thousands of threads, and I will hide DRAM latency by switching between them. Your per-thread access pattern does not matter for latency hiding — only the total number of threads.”

The GPU approach has a major advantage: it works regardless of access pattern complexity. Pointer chasing, indirect indexing, random scatter/gather — these patterns destroy CPU prefetcher performance but are handled just fine by GPU occupancy-based latency hiding (though they may still suffer from low bandwidth due to poor coalescing, discussed next).

Coalescing vs Cache Lines

On CPUs, the cache line is the fundamental unit of memory transfer, and the primary optimization is ensuring spatial locality: access memory sequentially so that each cache line fetch is fully utilized.

On GPUs, the analogous concept is memory coalescing. When a warp of 32 threads issues memory accesses, the hardware memory controller examines all 32 addresses and combines (coalesces) them into the minimum number of memory transactions. If all 32 threads access consecutive 4-byte elements (addresses 0, 4, 8, …, 124), the hardware issues a single 128-byte transaction. If the 32 threads access scattered addresses, the hardware must issue up to 32 separate transactions.

📊

Coalescing Impact on Global Memory Bandwidth

Access PatternTransactions per WarpEffective BandwidthUtilization
Fully coalesced (consecutive addresses) 1 x 128B ~3,000 GB/s ~90% of peak
Strided by 2 (every other element) 2 x 128B ~1,500 GB/s ~45% of peak
Strided by 32 (one element per cache line) 32 x 32B ~100 GB/s ~3% of peak
Random (scattered addresses) Up to 32 x 32B ~80-150 GB/s ~3-5% of peak
Note: H100 SXM, peak HBM bandwidth ~3,350 GB/s. Strided-by-32 is the worst regular pattern because each thread hits a different cache line.

The parallel to CPU cache lines is clear. In both cases, the hardware fetches data in fixed-size chunks (64 bytes for CPU cache lines, 32-byte sectors or 128-byte transactions for GPU). In both cases, the optimization is to ensure your access pattern uses the full chunk. But the mechanism is different:

  • CPU: One thread accesses memory sequentially. The cache line is utilized because successive accesses from the same thread hit the same line.
  • GPU: 32 threads access memory simultaneously. The transaction is utilized because the 32 threads’ addresses are adjacent, so they pack into the same transaction.

This distinction matters enormously for data layout. On a CPU, an Array of Structures (AoS) layout often works well for single-threaded code because fields of the same struct are adjacent and are loaded together in one cache line. On a GPU, Structure of Arrays (SoA) is almost always better because it ensures that when 32 threads access the same field, their addresses are consecutive and coalesce perfectly.

// AoS: Good for CPU single-thread, bad for GPU coalescing
struct Particle { float x, y, z, mass; };
Particle particles[N];
// Thread i accesses particles[i].x -- stride of 16 bytes between threads

// SoA: Bad for CPU single-thread (usually), great for GPU coalescing
struct Particles {
    float x[N], y[N], z[N], mass[N];
};
// Thread i accesses particles.x[i] -- stride of 4 bytes between threads (coalesced!)
💡 AoS vs SoA: The Universal Tradeoff

AoS groups all fields of one entity together (good for spatial locality when processing one entity at a time). SoA groups all values of one field together (good for SIMD/SIMT processing of one field across many entities). CPUs benefit from SoA too when using SIMD intrinsics. The rule is simple: match your data layout to your access pattern. On GPUs, the access pattern is always “32 threads processing 32 entities in lockstep,” so SoA almost always wins.

Part III: Why the Architectures Diverge

The Latency-Throughput Tradeoff

The fundamental reason CPUs and GPUs have different memory hierarchies is that they optimize for different objectives.

A CPU core must minimize the latency of a single instruction stream. If a program loads a value from memory and immediately uses it in a comparison and branch, the CPU must get that value as fast as possible — ideally from L1 cache in 4 cycles. If the value is not cached, the CPU uses branch prediction to speculate on the branch outcome, out-of-order execution to find independent work to do while waiting, and prefetching to try to load the value before it is needed. All of these mechanisms are complex, power-hungry, and effective for sequential programs.

A GPU SM must maximize the throughput of NN independent instruction streams. If one warp needs to wait 500 cycles for a memory load, that is fine — the SM runs 15 other warps in the meantime. No prediction, no speculation, no reordering needed. The hardware is simple: a warp scheduler that round-robins through ready warps. The complexity is offloaded to the programmer, who must provide enough parallel work (high occupancy) and arrange data for efficient access (coalescing).

Transistor Budget Allocation (Approximate)

(% of die area)
CPU: Execution units
20 % of die area
CPU: Caches (L1/L2/L3)
45 % of die area
CPU: Branch prediction + OoO + prefetch
35 % of die area
GPU: Execution units (ALUs, tensor cores) 2.5x more than CPU
50 % of die area
GPU: Register file + shared memory
30 % of die area
GPU: Caches + schedulers Simple schedulers, minimal prediction
20 % of die area

The GPU devotes roughly 50% of its transistors to execution units, compared to about 20% for a CPU. The CPU uses the remaining 80% for caches, branch prediction, out-of-order machinery, and prefetchers — all in service of making one thread fast. The GPU uses its remaining 50% for register files and shared memory — in service of keeping many threads’ data close to the execution units.

This is not a value judgment. CPUs are not “wasting” transistors on cache and prediction. Those transistors are essential for the single-threaded, branchy, pointer-heavy workloads that CPUs excel at. GPUs are not “wasting” transistors on massive register files. Those registers are essential for maintaining the occupancy that hides memory latency.

Memory Bandwidth: Quality vs Quantity

CPUs and GPUs also differ in the type and amount of memory bandwidth available.

A modern server CPU (dual-socket Sapphire Rapids) might have 8 channels of DDR5-4800, providing roughly 300-380 GB/s of total memory bandwidth across both sockets. A single CPU core can sustain perhaps 25-50 GB/s.

A modern GPU (H100 SXM) has 80 GB of HBM3 memory providing 3,350 GB/s — roughly 10x the bandwidth of a dual-socket server. This bandwidth is shared across 128 SMs, but because the workload is parallel, most of those SMs can sustain memory accesses simultaneously.

📊

CPU vs GPU Memory Bandwidth Comparison

MetricServer CPU (2x Xeon 8480+)GPU (H100 SXM)
Memory Technology DDR5-4800 (8 channels/socket) HBM3 (5 stacks, 10 channels)
Total Bandwidth ~380 GB/s 3,350 GB/s
Memory Capacity Up to 4 TB 80 GB
Per-Core/SM Bandwidth (sustained) ~25-50 GB/s ~26 GB/s
Bandwidth / FLOP (FP32) ~0.05 bytes/FLOP ~0.05 bytes/FLOP
Note: Per-SM GPU bandwidth assumes 128 SMs sharing 3,350 GB/s. Both architectures achieve similar bytes-per-FLOP ratios, confirming that memory bandwidth scales with compute.

Interestingly, the bandwidth-per-FLOP ratio is similar between CPUs and GPUs. Both architectures are roughly balanced: they can deliver about 0.05 bytes per FLOP from main memory. The difference is scale — the GPU has 10x more total bandwidth and 10x more total compute. This is why GPUs win on parallel workloads: they have more of everything, provided you can actually use it all.

The choice of HBM over DDR is itself a consequence of the GPU design philosophy. HBM uses a wide interface (1024-bit bus per stack) with moderate clock speeds, achieving high bandwidth at lower power per bit. DDR uses a narrow interface (64-bit per channel) with high clock speeds, optimizing for latency and cost per gigabyte. HBM costs more per gigabyte but delivers more bandwidth per watt — exactly the tradeoff a throughput-oriented architecture wants.

Register Files: The Hidden Divergence

One of the most underappreciated differences between CPU and GPU architecture is the register file.

A CPU core has a small register file: 16 general-purpose registers on x86-64 (plus 16-32 SIMD registers). The compiler must carefully allocate these registers, and when they run out, values spill to the stack (L1 cache). The small register file works because the CPU has fast caches to catch spills, and because one thread does not need many live values simultaneously.

A GPU SM has an enormous register file: 256 KB per SM on the H100, shared across all threads running on that SM. Each thread can use up to 255 registers (1,020 bytes). If a kernel uses 64 registers per thread, each SM can support 256×102464×4=1024\frac{256 \times 1024}{64 \times 4} = 1024 threads, or 32 warps. If a kernel uses 128 registers per thread, only 512 threads (16 warps) fit — reducing occupancy.

This creates a tension unique to GPU programming: register pressure trades off against occupancy. Using more registers per thread means fewer threads per SM, which means less latency hiding. Using fewer registers per thread means more spills to local memory (backed by L1/L2/HBM), which is slow. The optimal balance depends on the kernel’s arithmetic intensity and memory access patterns.

CPUs do not have this tradeoff. A CPU thread always has 16 GPRs; there is no choice to make. Register spills go to L1 cache (fast). The number of threads is determined by the operating system, not by register allocation. This is another way in which GPU optimization is a fundamentally different discipline.

Part IV: Practical Implications

When CPU Optimization Intuitions Transfer to GPUs

Some CPU optimization insights carry over to GPUs, even if the mechanism is different:

Data layout matters. On CPUs, you arrange data for spatial locality (sequential access within one thread). On GPUs, you arrange data for coalescing (adjacent access across threads in a warp). Both boil down to: ensure memory transactions carry maximal useful data.

Working set size matters. On CPUs, you want your hot data in L1 (48 KB). On GPUs, you want your hot data in shared memory (up to 228 KB per SM) or registers (up to 256 KB per SM). Both boil down to: keep data close to the execution units.

Arithmetic intensity determines the optimization target. On both architectures, if your operation does many FLOPs per byte loaded from memory (high arithmetic intensity), it is compute-bound and you should optimize execution. If it does few FLOPs per byte (low arithmetic intensity), it is memory-bound and you should optimize memory access. The roofline model applies equally to CPUs and GPUs.

Avoid unnecessary memory traffic. Kernel fusion (combining multiple operations into one pass over the data) reduces memory traffic on both architectures. On CPUs, this means fewer cache misses. On GPUs, this means fewer global memory transactions. FlashAttention is a spectacular example of this principle applied to GPU kernels.

When CPU Optimization Intuitions Do NOT Transfer

“Sort your data to help the branch predictor.” This CPU-centric optimization has no GPU analogue. GPUs do not have branch predictors. Instead, you reorganize data to minimize warp divergence — which may or may not involve sorting.

“Use software prefetching for irregular access patterns.” GPUs do not support software prefetch instructions. Instead, you increase occupancy (more threads) to hide latency through parallelism. If your access pattern is irregular, the GPU handles it by switching to other warps while waiting, not by predicting and prefetching.

“Align data to cache line boundaries to avoid false sharing.” GPUs do not have the false sharing problem in the CPU sense, because GPU threads within a warp access memory cooperatively (coalescing), and different warps are not expected to access the same cache line simultaneously. The GPU equivalent concern is bank conflicts in shared memory, which occur when multiple threads in a warp access the same shared memory bank.

“Out-of-order execution will find independent work while waiting for memory.” GPU threads execute in order. There is no out-of-order machinery. Latency is hidden by switching warps, not by reordering instructions within a thread. This means GPU code must be written with instruction-level parallelism (ILP) in mind — independent instructions within a thread that the compiler can schedule to fill pipeline slots — but the hardware does not reorder dynamically.

📊

CPU vs GPU Optimization Strategies

ProblemCPU StrategyGPU Strategy
Memory latency Prefetching + caches + OoO execution Warp switching (occupancy)
Branch overhead Branch prediction (97%+ accuracy) Eliminate branches; minimize divergence
Data layout AoS for single-thread locality; SoA for SIMD SoA almost always (coalescing)
Parallel data conflicts Avoid false sharing (pad to 64B) Avoid bank conflicts (pad shared memory)
Memory bandwidth Fit in cache; prefetch; sequential access Coalesce; use shared memory; tile
Register pressure Not a concern (fixed 16 GPRs, spill to L1) Critical tradeoff with occupancy
Kernel fusion Reduces cache misses Reduces global memory transactions
Note: Both architectures reward the same general principle (keep data close, reduce memory traffic) but through different mechanisms.

A Concrete Example: Matrix Multiplication

Matrix multiplication illustrates the divergence beautifully.

CPU approach: Block (tile) the matrices so that each tile fits in L1 cache. Iterate within the tile using sequential access, which the prefetcher handles. Use SIMD (AVX-512) to process 16 floats per instruction. The branch predictor handles loop control. Typical optimized GEMM on a modern CPU achieves 80-90% of peak FLOPS.

GPU approach: Each thread block loads a tile of both matrices into shared memory (explicit software-managed cache). Threads within the block cooperate to load the tile using coalesced global memory reads. Then each thread computes its portion of the output using data from shared memory (23 cycles) instead of global memory (500 cycles). Register tiling further reduces shared memory reads. No prefetching, no branch prediction involved. Typical optimized GEMM on a modern GPU achieves 70-80% of peak FLOPS.

Both approaches are tiling to keep data close to execution units. But the CPU relies on hardware caches and automatic prefetching, while the GPU relies on programmer-managed shared memory and explicit data movement. The CPU uses SIMD for data-level parallelism within one thread; the GPU uses SIMT for data-level parallelism across 32 threads. Same mathematical optimization (tiling), completely different hardware mechanisms.

The Convergence Trend

Modern hardware is slowly converging. Recent CPUs have added more SIMD width (AVX-512 processes 16 floats at once, approaching GPU warp-like parallelism within a single core). Intel’s AMX (Advanced Matrix Extensions) adds explicit 2D tiling instructions that look remarkably like GPU tensor core operations.

Recent GPUs have added more cache-like behavior. The NVIDIA H100’s L2 cache is 50 MB — larger than many CPU L3 caches. The L1/shared memory partition is configurable, allowing the programmer to trade explicit management for automatic caching. Texture caches have always provided automatic 2D spatial locality, similar to CPU L1 behavior.

But the fundamental distinction remains. CPUs still optimize for single-thread latency with complex prediction and speculation. GPUs still optimize for aggregate throughput with massive parallelism and simple scheduling. The convergence is at the margins; the core philosophies are still divergent as of 2025.

Choosing the Right Architecture

If your workload has complex control flow, pointer-heavy data structures, or limited parallelism (fewer than ~1000 independent operations), the CPU will almost certainly be faster. Its branch predictor, out-of-order engine, and deep cache hierarchy are exactly what you need.

If your workload applies the same operation to thousands or millions of data elements with minimal branching, the GPU will likely be 10-100x faster. Its massive parallelism, high memory bandwidth, and SIMT execution model are designed for exactly this case.

Many real workloads have both phases. The art of heterogeneous computing is knowing which parts belong on which architecture.

Measuring and Profiling: Different Tools, Same Mindset

The profiling approach is similar across architectures — identify the bottleneck (compute or memory), then optimize the bottleneck — but the tools and metrics differ.

CPU profiling: perf stat shows cache miss rates, branch misprediction rates, and IPC (instructions per cycle). An IPC below 1.0 usually indicates memory stalls; above 2.0 indicates good utilization. perf c2c detects false sharing. Intel VTune provides detailed cache hierarchy analysis.

GPU profiling: NVIDIA Nsight Compute shows achieved bandwidth as a percentage of peak, warp divergence rates, occupancy, shared memory bank conflicts, and register usage. The primary question is always: “Is this kernel compute-bound or memory-bound?” The roofline model (ncu --set roofline) gives the answer directly.

📊

Key Profiling Metrics by Architecture

What to MeasureCPU MetricGPU Metric
Memory bottleneck L1/L2/L3 miss rates, IPC Achieved bandwidth vs peak, memory throughput %
Branch/divergence overhead Branch misprediction rate Warp divergence efficiency
Parallelism utilization IPC, SIMD utilization Occupancy, warp scheduling efficiency
Data layout problems Cache miss rates by access pattern Global memory load/store efficiency (coalescing)
Shared data conflicts perf c2c (false sharing) Shared memory bank conflicts
Note: Both architectures benefit from the same profiling mindset: find the bottleneck, understand why it is a bottleneck, and fix the data access pattern or algorithm.

Conclusion

CPU and GPU memory hierarchies solve the same fundamental problem — the processor is faster than memory — through opposite strategies. The CPU hides latency with prediction and caching, optimizing each thread’s experience at enormous transistor cost. The GPU hides latency with parallelism and scheduling, optimizing aggregate throughput with simple hardware.

Understanding both architectures, and understanding why they differ, is essential for anyone working at the intersection of systems and accelerated computing. The CPU optimization skills you have built — thinking in cache lines, respecting the memory hierarchy, minimizing data movement — all transfer to GPUs. But the mechanisms change. Cache lines become coalescing transactions. Branch prediction becomes warp-uniform control flow. Prefetching becomes occupancy. False sharing becomes bank conflicts. Hardware-managed caches become software-managed shared memory.

The unifying principle across both architectures is simple: know where your data lives, and minimize the cost of getting it to the execution units. The hardware details differ, but that principle never changes.