If you write CUDA kernels, memory is almost certainly your bottleneck. Not compute. Not synchronization. Memory. On a modern GPU like the H100, peak compute throughput is approximately 1,979 TFLOPS for FP16 tensor operations, but peak memory bandwidth is 3,350 GB/s. That ratio — roughly 590 FLOP per byte — means that any operation with arithmetic intensity below 590 needs data faster than HBM can deliver it. Matrix multiplications can sometimes exceed this threshold. Almost everything else — elementwise ops, reductions, normalization, attention — is memory-bound.
This post is the complete treatment of GPU memory. We will walk through every level of the hierarchy, from registers to HBM, with real measurements, code examples, and the optimization strategies that matter in practice. The goal is to give you a mental model precise enough to predict kernel performance before you write a single line of profiling code.
The 1000x Bandwidth Gap
The GPU memory hierarchy spans roughly three orders of magnitude in bandwidth and four in latency:
GPU Memory Hierarchy Comparison (H100 SXM)
| Memory Type | Size per SM | Total Size | Latency (cycles) | Bandwidth | Scope |
|---|---|---|---|---|---|
| Registers | 256 KB | ~33 MB (128 SMs) | 1 | ~100+ TB/s (aggregate) | Thread |
| Shared Memory (SRAM) | Up to 228 KB | ~29 MB (128 SMs) | ~20-30 | ~3+ TB/s (aggregate) | Block |
| L1 Cache | 256 KB (shared pool) | -- | ~30-40 | ~1.5-2 TB/s | SM |
| L2 Cache | -- | 50 MB | ~200 | ~12 TB/s | GPU |
| Global Memory (HBM3) | -- | 80 GB | ~400-600 | 3,350 GB/s | GPU + Host |
The critical insight: register access is free (1 cycle, zero overhead), shared memory is cheap (~25 cycles), and global memory is expensive (~500 cycles). A kernel that can keep its working data in registers and shared memory will be compute-bound. A kernel that constantly reads from HBM will be memory-bound. The vast majority of kernels are memory-bound.
On an A100, the compute-to-bandwidth ratio for FP16 tensor ops is . On H100, it rises to . Any kernel with arithmetic intensity below this threshold is memory-bound. For reference, elementwise operations have an intensity of ~1 FLOP/byte. Reductions: ~1 FLOP/byte. Layer normalization: ~5 FLOP/byte. Even matrix multiplication at small sizes can be memory-bound. Only large GEMMs consistently exceed the threshold.
Architecture Comparison: V100 vs A100 vs H100
Before diving into each memory level, here is how the memory subsystem has evolved across three generations:
Memory Subsystem Evolution: V100, A100, H100
| Spec | V100 (Volta) | A100 (Ampere) | H100 (Hopper) |
|---|---|---|---|
| SMs | 80 | 108 | 132 |
| Registers per SM | 256 KB | 256 KB | 256 KB |
| Max shared memory per SM | 96 KB | 164 KB | 228 KB |
| L1 cache per SM | 128 KB (shared pool) | 192 KB (shared pool) | 256 KB (shared pool) |
| L2 cache total | 6 MB | 40 MB | 50 MB |
| HBM type | HBM2 | HBM2e | HBM3 |
| HBM capacity | 16-32 GB | 40-80 GB | 80 GB |
| HBM bandwidth | 900 GB/s | 2,039 GB/s | 3,350 GB/s |
| Memory bus width | 4096-bit | 5120-bit | 5120-bit |
| Async copy (cp.async) | No | Yes | Yes (enhanced TMA) |
| L2 residency control | No | Partial | Full |
Key evolutionary trends: shared memory capacity has more than doubled (96 KB to 228 KB), L2 cache has grown 8x (6 MB to 50 MB), and HBM bandwidth has nearly 4x’d. Each generation makes the memory hierarchy deeper and wider, but the fundamental optimization principles remain the same. The biggest architectural addition is the Tensor Memory Accelerator (TMA) on Hopper, which offloads complex addressing patterns from the SM entirely.
Level 1: Registers — The Fastest Memory You Cannot See
The Basics
Every CUDA thread has access to a private set of registers. On all recent NVIDIA architectures (Volta, Ampere, Hopper), each SM has a 256 KB register file divided into 65,536 32-bit registers. These registers are allocated to threads at launch time and persist for the thread’s entire lifetime.
Registers are the fastest memory on the GPU:
- Latency: 1 cycle — the same as an arithmetic instruction
- Bandwidth: Effectively unlimited, since register reads/writes are part of the instruction pipeline
- Scope: Private to each thread (no sharing, no synchronization needed)
Any local variable in your CUDA kernel that the compiler can keep in a register is free to access. The cost is zero. This is fundamentally different from CPU programming where register allocation is mostly invisible — on GPUs, register usage directly impacts how many threads can run simultaneously.
Register Pressure and the Occupancy Tradeoff
Here is the core tension: each thread that uses more registers reduces the number of threads that can coexist on the same SM. The 65,536 registers per SM are shared among all active threads. If your kernel uses 64 registers per thread and you launch blocks of 256 threads, that is registers per block. The SM can fit at most blocks, for active threads. The maximum is 2,048 threads per SM on Ampere/Hopper, so you are at 50% occupancy.
Occupancy vs Registers per Thread (256 threads/block, Ampere)
line| Metric | 16 | 32 | 48 | 64 | 80 | 96 | 128 | 256 |
|---|---|---|---|---|---|---|---|---|
| Occupancy (%) |
Lower occupancy means fewer warps available for the warp scheduler to choose from, which reduces the GPU’s ability to hide memory latency. When a warp stalls waiting for a global memory load (400-600 cycles), the scheduler switches to another ready warp. With fewer warps, there may not be enough ready warps to keep the execution units busy during these stalls.
However, occupancy is not the only factor. A kernel with 50% occupancy that keeps all data in registers can outperform a kernel with 100% occupancy that repeatedly loads from global memory. This is the register pressure vs occupancy tradeoff, and there is no universal answer — it depends on your kernel’s arithmetic intensity and memory access patterns.
Aim for at least 50% occupancy (or about 32+ active warps per SM). Below that, latency hiding becomes difficult. Above 50%, further occupancy gains typically yield diminishing returns. If you can keep critical data in registers at 50% occupancy instead of loading from shared or global memory at 100% occupancy, the register-heavy approach often wins.
Register Spilling: The Silent Catastrophe
When the compiler cannot fit all local variables into the available registers, it spills them to local memory. Despite the misleading name, local memory is not fast on-chip storage — it is physically located in global memory (HBM), backed by the L1 and L2 caches. A register spill turns a 1-cycle access into a potential 400-600 cycle access if it misses both caches.
You can detect register spilling in the compilation output:
# Compile with verbose output to see register usage and spills
nvcc -Xptxas -v my_kernel.cu
# Output includes lines like:
# ptxas info: Used 64 registers, 48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# If "spill stores" or "spill loads" are non-zero, you have a problem.
Even a few spilled registers can devastate performance. Each spill generates a store to local memory and a later load from local memory. If the spilled value is in a hot loop, this happens on every iteration. We have measured 3-5x slowdowns from register spilling in production kernels. Always compile with -Xptxas -v and check for spills.
Controlling Register Allocation with __launch_bounds__
You can tell the compiler the maximum number of threads per block and optionally the minimum number of blocks per SM. The compiler uses this information to optimize register allocation:
// Tell the compiler: this kernel will be launched with at most 256 threads/block
// and we want at least 2 blocks per SM
__global__ void __launch_bounds__(256, 2)
my_kernel(float *input, float *output, int n) {
// With 256 threads and min 2 blocks, the compiler knows it has at most
// 65536 / (256 * 2) = 128 registers per thread to work with.
// It will aggressively try to stay under this limit to avoid spilling.
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float a = input[idx];
float b = a * a + 3.14159f;
float c = sqrtf(b) * a;
output[idx] = c;
}
}
Without __launch_bounds__, the compiler assumes worst-case scenarios and may use more registers than necessary (to avoid spills) or spill more than necessary (to maintain higher occupancy). Providing explicit bounds lets the compiler make better decisions.
// Another approach: directly limit registers per thread
// This overrides the compiler's register allocation decisions
__global__ void __launch_bounds__(256)
__maxnreg__(64) // Hard limit: 64 registers per thread
my_kernel_constrained(float *data, int n) {
// If the kernel needs more than 64 registers, the compiler WILL spill.
// Use this only when you know the register budget precisely.
}
Impact of __launch_bounds__ on a Real Kernel (Softmax, A100)
| Configuration | Registers/Thread | Occupancy | Spills | Throughput |
|---|---|---|---|---|
| No launch bounds | 96 | 33% | 0 bytes | 1,200 GB/s |
| __launch_bounds__(256, 2) | 64 | 50% | 0 bytes | 1,550 GB/s |
| __launch_bounds__(256, 4) | 32 | 100% | 128 bytes | 1,100 GB/s |
The sweet spot is clear: __launch_bounds__(256, 2) at 50% occupancy outperformed both the unconstrained version (too few warps for latency hiding) and the over-constrained version (spilling negated the occupancy gain).
Level 2: Shared Memory (SRAM) — The Programmer’s Scratchpad
Architecture and Sizing
Shared memory is fast, on-chip SRAM that sits physically within the SM. It is shared by all threads in a thread block, making it ideal for cooperative data access patterns. Unlike registers (private, implicit) or caches (automatic, hardware-managed), shared memory is explicitly managed by the programmer — you decide what goes in, when it goes in, and how it is laid out.
The capacity has grown significantly across generations:
- V100 (Volta): Up to 96 KB per SM
- A100 (Ampere): Up to 164 KB per SM (configurable, shared with L1)
- H100 (Hopper): Up to 228 KB per SM (configurable, shared with L1)
On Ampere and Hopper, the on-chip memory is a unified pool that can be split between shared memory and L1 cache. You can configure the split per kernel:
// Request maximum shared memory for this kernel
cudaFuncSetAttribute(
my_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
164 * 1024 // 164 KB on A100
);
// Or set the preferred carveout ratio
cudaFuncSetAttribute(
my_kernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared
);
Shared Memory / L1 Cache Configurations (A100)
| Config | Shared Memory | L1 Cache | Best For |
|---|---|---|---|
| Default | 48 KB | 144 KB | Most kernels -- streaming access benefits from large L1 |
| Max SMEM | 164 KB | 28 KB | Large tiles: GEMM, FlashAttention, convolution |
| Max L1 | 28 KB | 164 KB | Streaming kernels with poor data reuse |
Why Shared Memory Matters: Data Reuse
The core value proposition is data reuse without HBM round-trips. Load data from global memory into shared memory once, then access it many times at much higher bandwidth (~3 TB/s aggregate vs ~2-3.3 TB/s peak HBM).
Kernel Performance: With vs Without Shared Memory Tiling (V100)
(GB/s effective bandwidth)The GEMM result is the most striking: tiling with shared memory gives a 15x speedup because it converts global memory accesses per output element into accesses, with the rest served from shared memory. Each element loaded from HBM is reused times, amortizing the cost of the slow global load.
Bank Conflicts: The Hidden Performance Killer
Shared memory is organized into 32 banks, each 4 bytes wide. In a single clock cycle, the hardware can service one request per bank. When 32 threads in a warp each access a different bank, all accesses proceed in parallel — full throughput. But when multiple threads access different addresses that map to the same bank, those accesses are serialized.
The bank for a given byte address is:
Two threads conflict when they access different 4-byte words in the same bank. One important exception: when multiple threads read the exact same address, the hardware broadcasts the value to all requesting threads at no extra cost.
Bank Conflict Impact on Shared Memory Bandwidth (V100)
| Conflict Degree | Access Pattern | Effective Bandwidth | Penalty |
|---|---|---|---|
| None (1-way) | Stride 1 -- consecutive | 2.8 TB/s | 1x (optimal) |
| 2-way | Stride 2 | 1.4 TB/s | 2x slower |
| 4-way | Stride 4 | 0.7 TB/s | 4x slower |
| 8-way | Stride 8 | 0.35 TB/s | 8x slower |
| 32-way | Stride 32 (all same bank) | 0.09 TB/s | 32x slower -- full serialization |
Let us look at concrete examples:
__shared__ float data[1024];
// NO bank conflicts: stride 1 (consecutive access)
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[1] (bank 1), ...
// Thread 31 -> data[31] (bank 31). All 32 banks used, zero conflicts.
float val = data[threadIdx.x];
// 2-WAY bank conflicts: stride 2
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[2] (bank 2), ...
// Thread 16 -> data[32] (bank 0 again!). Banks 0,2,4,...30 each hit twice.
float val = data[threadIdx.x * 2];
// 32-WAY bank conflicts: stride 32 (catastrophic!)
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[32] (bank 0), ...
// ALL threads hit bank 0. Fully serialized.
float val = data[threadIdx.x * 32];
The Padding Trick
The most common bank conflict occurs when accessing columns of a 2D shared memory array. If you have a 32x32 tile and threads access it column-wise, every thread hits the same bank:
// PROBLEM: Column access on a 32-wide array causes 32-way bank conflicts
__shared__ float tile[32][32];
// Thread k reads tile[k][col] for some fixed col.
// Address of tile[k][col] = (k * 32 + col) * 4 bytes
// Bank = (k * 32 + col) % 32 = col (for all k!)
// All 32 threads hit bank 'col'. Full serialization.
float val = tile[threadIdx.x][some_col];
// FIX: Pad each row by 1 element
__shared__ float tile[32][33]; // 33 instead of 32!
// Address of tile[k][col] = (k * 33 + col) * 4 bytes
// Bank = (k * 33 + col) % 32 = (k + col) % 32
// Different k values -> different banks. Zero conflicts!
float val = tile[threadIdx.x][some_col];
The cost of padding is 32 extra floats (128 bytes) per tile — completely negligible. The benefit can be 30%+ speedup.
For any shared memory array that will be accessed by column, declare it as float tile[N][N+1]. This eliminates all column-access bank conflicts at the cost of bytes of wasted padding. Works for any power-of-two N.
Detecting Bank Conflicts
Nsight Compute reports bank conflicts directly:
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared \
./my_kernel
If l1tex__data_bank_conflicts_pipe_lsu_mem_shared is high (over 10% of shared memory accesses), you have a problem worth fixing.
Async Copy: cp.async (Ampere+)
Traditionally, loading data from global to shared memory requires two steps: load from global memory into a register, then store the register into shared memory. This occupies registers and serializes the load-store sequence.
Ampere introduced cp.async, which copies data directly from global memory to shared memory without going through registers. This has two benefits: it frees up registers, and it allows overlap of the copy with computation on previously loaded data.
// Traditional: Global -> Register -> Shared (2 steps, uses registers)
__shared__ float smem[BLOCK_SIZE];
smem[threadIdx.x] = global[idx]; // Implicit: load to register, then store to smem
__syncthreads();
// Async (Ampere+): Global -> Shared directly (bypasses registers)
#include <cuda/pipeline>
__shared__ float smem[BLOCK_SIZE];
auto pipe = cuda::make_pipeline();
pipe.producer_acquire();
cuda::memcpy_async(&smem[threadIdx.x], &global[idx], sizeof(float), pipe);
pipe.producer_commit();
// ... do other useful work while the copy is in flight ...
pipe.consumer_wait(); // Block until copy completes
__syncthreads();
// Now smem is ready to use
On Hopper, this is taken further with the Tensor Memory Accelerator (TMA), which can load multi-dimensional tiles from global memory to shared memory in a single instruction, handling all the address calculation and boundary checking in hardware.
Async Copy Impact on Tiled Kernels (A100)
| Kernel | Without cp.async | With cp.async | Speedup |
|---|---|---|---|
| GEMM (CUTLASS) | 280 TFLOPS | 295 TFLOPS | 1.05x |
| FlashAttention-2 | 145 TFLOPS | 156 TFLOPS | 1.07x |
| Custom reduction | 1,600 GB/s | 1,750 GB/s | 1.09x |
Common Shared Memory Patterns
Tiling (general): Load a tile of data from global memory into shared memory, process it, then move to the next tile. The key is that each element loaded from global memory is accessed multiple times from shared memory. GEMM is the canonical example: each element of the A and B tiles is used times.
Reduction: Load N elements into shared memory, reduce in parallel with steps. Each step halves the active threads. Use sequential addressing (not interleaved) to minimize bank conflicts in later steps and avoid warp divergence.
Stencil / convolution: Load a tile plus a halo region into shared memory. Each thread computes its output from the cached neighborhood. This avoids redundant global memory loads for overlapping receptive fields.
Histogram: Use shared memory for per-block partial histograms, then reduce across blocks in a second kernel. This avoids expensive atomic contention on global memory.
Level 3: L1 and L2 Caches
L1 Cache
The L1 cache is a per-SM hardware-managed cache. On Ampere and Hopper, it shares a physical SRAM pool with shared memory (the configurable split we discussed above). Key characteristics:
- Size: Up to 128-256 KB per SM (depending on shared memory configuration)
- Line size: 128 bytes (32 floats)
- Latency: ~30-40 cycles
- Scope: Per-SM — only threads on the same SM benefit from cached data
The L1 cache is primarily useful for:
- Register spills: Local memory (spilled registers) is cached in L1
- Stack frames: Function call frames and local arrays go through L1
- Texture loads: The texture pipeline uses L1 for spatial locality caching
- Global loads (when enabled): On Volta+, global loads can optionally go through L1
On Ampere, the L1 cache also serves as the staging area for cp.async operations.
L2 Cache
The L2 cache is a GPU-wide shared cache. All global memory traffic passes through L2 — there is no way to bypass it. This makes it the most important cache on the GPU.
- Size: 6 MB (V100), 40 MB (A100), 50 MB (H100)
- Line size: 128 bytes
- Latency: ~200 cycles
- Bandwidth: Much higher than HBM — up to ~12 TB/s on H100
- Scope: Entire GPU — all SMs share the same L2
The L2 cache matters enormously for kernels that have temporal locality — when multiple SMs (or multiple kernel launches) access the same data. Examples include:
- KV cache in LLM inference: The same cached keys/values are read by each attention head
- Embedding tables: Popular embeddings are accessed frequently
- Multi-pass algorithms: Kernel fusion cannot always eliminate intermediate buffers
L2 Cache Impact by Working Set Size (A100, streaming read kernel)
| Working Set | Fits in L2? | L2 Hit Rate | Effective Bandwidth | Relative Perf |
|---|---|---|---|---|
| 4 MB | Yes | ~98% | ~5,000 GB/s | 2.8x |
| 20 MB | Yes | ~90% | ~4,000 GB/s | 2.2x |
| 40 MB | Barely | ~70% | ~2,500 GB/s | 1.4x |
| 200 MB | No | ~5% | ~1,800 GB/s | 1.0x (baseline) |
| 2 GB (random) | No | ~1% | ~300 GB/s | 0.17x |
L2 Residency Control (Ampere and Hopper)
Starting with Ampere, NVIDIA provides APIs to hint which data should persist in L2 and which should stream through:
// Tell the hardware: keep kv_cache in L2 as long as possible
cudaAccessPolicyWindow window = {};
window.base_ptr = kv_cache_ptr;
window.num_bytes = kv_cache_size; // Must be <= L2 size for full residency
window.hitRatio = 1.0f; // Try to keep 100% of this data in L2
window.hitProp = cudaAccessPropertyPersisting; // Persist in L2
window.missProp = cudaAccessPropertyStreaming; // Evict non-critical data
cudaCtxSetAccessPolicyWindow(&window);
On Hopper, this is enhanced with more granular control. You can partition the L2 into regions dedicated to specific data, ensuring that high-value data (like KV caches) is never evicted by streaming loads from other kernels.
The L1 and L2 caches both use 128-byte cache lines. When a warp issues a global memory load, the hardware fetches entire 128-byte lines. If 32 threads each load a consecutive 4-byte float, that is exactly bytes = 1 cache line = 1 memory transaction. If threads load non-consecutive addresses, each unique cache line touched requires a separate transaction. This is why coalescing matters so much — it is fundamentally about minimizing the number of cache lines fetched.
Level 4: Global Memory (HBM) — The Bandwidth Bottleneck
HBM Basics
Global memory is the large off-chip DRAM attached to the GPU via High Bandwidth Memory (HBM) stacks. It is the only memory visible to the host (CPU) and the primary storage for all kernel data.
- Capacity: 16-80 GB depending on GPU model
- Technology: HBM2 (V100), HBM2e (A100), HBM3 (H100)
- Peak bandwidth: 900 GB/s (V100), 2,039 GB/s (A100), 3,350 GB/s (H100)
- Latency: 400-600 cycles — the GPU hides this with massive thread parallelism
- Bus width: 4096-5120 bits — wide bus, but still the bottleneck for most kernels
The latency of 400-600 cycles sounds devastating, but GPUs are designed to tolerate it. With thousands of active threads per SM, the warp scheduler can switch to a ready warp whenever the current warp stalls on a memory access. This is why occupancy matters — enough active warps must exist to cover the latency gap.
The math: if a memory access takes 500 cycles and the SM can issue 1 instruction per cycle, you need at least 500 instructions (from other warps) in flight to fully hide the latency. With 32 threads per warp and, say, 8 instructions per warp between memory accesses, you need roughly warps to keep the SM fully utilized. That is why a single SM supports up to 64 warps (2,048 threads).
Memory Coalescing: THE Most Important Optimization
Memory coalescing is the single most impactful optimization for GPU kernels. Getting it right can mean 10-20x performance differences. Getting it wrong means your kernel runs at a fraction of peak bandwidth.
How coalescing works: When threads in a warp execute a load or store instruction, the hardware combines (coalesces) their individual memory requests into the minimum number of 128-byte cache line transactions. The best case: all 32 threads access consecutive 4-byte addresses aligned to a 128-byte boundary. This produces exactly 1 transaction. The worst case: all 32 threads access addresses in different cache lines, producing up to 32 transactions.
// PERFECT COALESCING: 1 transaction per warp
// Thread 0 -> data[0], Thread 1 -> data[1], ..., Thread 31 -> data[31]
// All addresses in one 128-byte line: data[0..31] = 128 bytes
__global__ void coalesced(float *data, float *out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = data[idx] * 2.0f; // Consecutive threads, consecutive addresses
}
// STRIDED ACCESS: stride transactions per warp
// Thread 0 -> data[0], Thread 1 -> data[stride], Thread 2 -> data[2*stride], ...
__global__ void strided(float *data, float *out, int n, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
out[idx] = data[idx] * 2.0f; // Stride-N access: each thread hits a different cache line
}
// RANDOM ACCESS: up to 32 transactions per warp
// Thread 0 -> data[random[0]], Thread 1 -> data[random[1]], ...
__global__ void random_access(float *data, float *out, int *indices, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = data[indices[idx]] * 2.0f; // Random: worst case 32 transactions
}
Global Memory Bandwidth: Coalescing Impact (A100)
| Access Pattern | Transactions per Warp | Achieved Bandwidth | % of Peak |
|---|---|---|---|
| Coalesced (stride 1) | 1 | ~1,800 GB/s | 88% |
| Stride 2 | 2 | ~900 GB/s | 44% |
| Stride 4 | 4 | ~450 GB/s | 22% |
| Stride 8 | 8 | ~230 GB/s | 11% |
| Stride 32 | 32 | ~60 GB/s | 3% |
| Random (scattered) | ~28 avg | ~70 GB/s | 3.5% |
The numbers are stark. Coalesced access achieves 88% of peak bandwidth. Random access achieves 3.5%. That is a 25x difference from a single optimization.
AoS vs SoA: The Data Layout That Determines Performance
The most common coalescing failure in real codebases comes from Array-of-Structures (AoS) vs Structure-of-Arrays (SoA) layout decisions.
AoS vs SoA Memory Layout for Particles
// AoS: Array of Structures -- natural for CPU, terrible for GPU
struct Particle {
float x, y, z, mass;
};
Particle *particles; // particles[i].x, particles[i].y, ...
__global__ void update_positions_aos(Particle *p, float dt, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
// Thread 0 reads p[0].x at address 0
// Thread 1 reads p[1].x at address 16 (sizeof(Particle) = 16)
// Stride of 4 floats! Only 25% of fetched bytes are useful.
p[i].x += p[i].vx * dt;
}
}
// SoA: Structure of Arrays -- optimal for GPU
struct ParticlesSoA {
float *x, *y, *z, *mass;
float *vx, *vy, *vz;
};
__global__ void update_positions_soa(ParticlesSoA p, float dt, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
// Thread 0 reads x[0] at address 0
// Thread 1 reads x[1] at address 4
// Perfect stride-1 access! 100% of fetched bytes are useful.
p.x[i] += p.vx[i] * dt;
}
}
AoS vs SoA Performance (Particle update, 10M particles, A100)
| Layout | Achieved Bandwidth | Useful Bandwidth | Speedup |
|---|---|---|---|
| AoS (stride 4) | ~1,600 GB/s total | ~400 GB/s useful | 1x |
| SoA (stride 1) | ~1,700 GB/s total | ~1,700 GB/s useful | 4x |
Array-of-Structures layout is the default in most C/C++ codebases because it is natural and cache-friendly on CPUs. On GPUs, it causes strided access patterns that waste 50-75% or more of memory bandwidth. When porting CPU code to CUDA, converting AoS to SoA is often the single highest-impact change you can make. Always default to SoA for GPU data structures.
Vectorized Loads: float4
Loading 4 floats at once (128 bits) instead of 1 float (32 bits) reduces instruction count and improves memory throughput:
// Scalar: each thread loads 1 float (4 bytes)
__global__ void kernel_scalar(float *in, float *out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = in[idx] * 2.0f;
}
// Vectorized: each thread loads 4 floats (16 bytes)
__global__ void kernel_vec4(float4 *in, float4 *out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float4 val = in[idx];
val.x *= 2.0f; val.y *= 2.0f; val.z *= 2.0f; val.w *= 2.0f;
out[idx] = val;
}
Bandwidth by Load Width (A100, elementwise kernel)
(GB/s)Use float4 loads whenever your data is 16-byte aligned and you process 4 contiguous elements per thread. This matches the GPU’s natural memory transaction width and reduces the number of load/store instructions by 4x. Most elementwise kernels (activations, normalization, residual connections) benefit from vectorization. The caveat: your array length must be divisible by 4, or you need a scalar tail loop.
The Canonical Example: Matrix Transpose
Matrix transpose is the textbook example of memory optimization because the naive version has fundamentally non-coalesced writes, and shared memory tiling fixes this elegantly.
The problem: Reading row-major input by rows is coalesced (consecutive threads read consecutive columns within the same row). But writing the transposed output by columns is non-coalesced (consecutive threads write to addresses that are N elements apart, where N is the matrix width).
// Naive transpose: coalesced reads, NON-coalesced writes
__global__ void transpose_naive(float *out, const float *in, int N) {
int x = blockIdx.x * 32 + threadIdx.x; // column
int y = blockIdx.y * 32 + threadIdx.y; // row
if (x < N && y < N) {
// Read: in[y * N + x] -- consecutive x for consecutive threads -> coalesced
// Write: out[x * N + y] -- consecutive x means stride-N writes -> NOT coalesced!
out[x * N + y] = in[y * N + x];
}
}
The fix: Use shared memory as a transposition buffer. Load a tile with coalesced reads, transpose it in shared memory (fast, on-chip), then write the transposed tile with coalesced writes.
// Tiled transpose: both reads AND writes are coalesced
__global__ void transpose_tiled(float *out, const float *in, int N) {
__shared__ float tile[32][33]; // Padded to avoid bank conflicts!
// Phase 1: Coalesced read from input into shared memory tile
int x_in = blockIdx.x * 32 + threadIdx.x;
int y_in = blockIdx.y * 32 + threadIdx.y;
if (x_in < N && y_in < N) {
tile[threadIdx.y][threadIdx.x] = in[y_in * N + x_in];
}
__syncthreads();
// Phase 2: Coalesced write from shared memory tile to output
// Note the swapped block indices!
int x_out = blockIdx.y * 32 + threadIdx.x;
int y_out = blockIdx.x * 32 + threadIdx.y;
if (x_out < N && y_out < N) {
// Read tile[threadIdx.x][threadIdx.y] -- column read, but padded so no bank conflict
// Write out[y_out * N + x_out] -- consecutive threadIdx.x -> consecutive x_out -> coalesced!
out[y_out * N + x_out] = tile[threadIdx.x][threadIdx.y];
}
}
The trick is subtle but powerful: by swapping blockIdx.x and blockIdx.y for the output coordinates, consecutive threads now write to consecutive output addresses. The transposition happens in shared memory, where the access pattern does not matter for coalescing (shared memory has no coalescing requirement — only bank conflicts matter, which we handle with padding).
Matrix Transpose Performance (4096x4096, V100)
| Implementation | Bandwidth (GB/s) | Efficiency | Key Detail |
|---|---|---|---|
| Naive (non-coalesced write) | 120 | 13% | Stride-N writes destroy coalescing |
| SMEM tiled (no padding) | 580 | 64% | Coalesced R/W but 32-way bank conflicts |
| SMEM tiled + padding [32][33] | 750 | 83% | Full coalescing, zero bank conflicts |
| V100 theoretical peak | 900 | 100% | Hardware limit |
The Optimization Hierarchy: What to Fix First
Not all memory optimizations are equal. Here is the priority order, with approximate impact ranges:
Cumulative Impact of Memory Optimizations (starting from naive kernel)
(GB/s achieved bandwidth (A100))The message is clear:
- Fix coalescing first (~7-10x impact). Convert AoS to SoA. Ensure consecutive threads access consecutive memory addresses. This alone gets you from 10% to 75%+ of peak bandwidth.
- Add shared memory tiling (~2-15x for kernels with data reuse). If each element loaded from global memory is used more than once, tiling through shared memory eliminates redundant global loads. The benefit scales with the reuse factor.
- Vectorize loads (~1.1-1.25x). Use
float4where possible. Easy to implement, modest but consistent gain. - Fix bank conflicts (~1.1-1.5x in shared memory-heavy kernels). Add padding to 2D shared memory arrays.
- Tune register usage (~1.0-1.3x). Use
__launch_bounds__to control occupancy. Eliminate spills. This is a fine-tuning step.
We have seen engineers spend days optimizing bank conflicts in shared memory while their kernel has a fundamental coalescing problem in global memory. Bank conflict optimization is worthless if the kernel is bottlenecked on non-coalesced global loads. Always profile first, identify the actual bottleneck, then optimize in priority order.
Practical Profiling with Nsight Compute
Nsight Compute (ncu) is the definitive tool for understanding GPU memory behavior. Here are the key metrics and what they tell you.
Step 1: Is Your Kernel Memory-Bound or Compute-Bound?
ncu --metrics \
sm__throughput.avg.pct_of_peak_sustained_active,\
gpu__dram_throughput.avg.pct_of_peak_sustained_active \
./my_kernel
- If
gpu__dram_throughputis > 60% andsm__throughputis < 40%: memory-bound. Optimize memory access patterns. - If
sm__throughputis > 60% andgpu__dram_throughputis < 40%: compute-bound. Optimize arithmetic. - If both are low: you have a latency problem (insufficient occupancy, synchronization overhead, or kernel launch overhead).
Step 2: Check Coalescing
ncu --metrics \
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \
./my_kernel
The ratio sectors / requests tells you the average number of 32-byte sectors fetched per load request. For perfectly coalesced float loads, each warp request should generate exactly 4 sectors (128 bytes / 32 bytes per sector = 4). If the ratio is much higher (e.g., 16 or 32), you have a coalescing problem.
Ideal: sectors/requests 4 for 4-byte types, 8 for 8-byte types
Bad: sectors/requests 32 means essentially every thread is accessing a different cache line
Step 3: Check Memory Throughput
ncu --metrics \
dram__bytes_read.sum,\
dram__bytes_write.sum,\
gpu__time_duration.sum \
./my_kernel
Calculate achieved bandwidth:
Compare this to peak HBM bandwidth. Targets:
Bandwidth Efficiency Assessment
| Achieved % of Peak | Assessment | Action |
|---|---|---|
| over 80% | Excellent -- near hardware limit | Move on to compute optimization or algorithmic improvements |
| 60-80% | Good -- minor improvements possible | Check vectorized loads, alignment, L2 utilization |
| 30-60% | Mediocre -- significant waste | Likely coalescing issues or L2 thrashing |
| under 30% | Poor -- major access pattern problem | Fix coalescing first. Check for AoS layout or strided access. |
Step 4: Check Shared Memory Efficiency
ncu --metrics \
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum,\
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum,\
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum \
./my_kernel
If bank conflicts are > 10% of wavefronts, apply the padding trick to your shared memory arrays.
Step 5: Check for Register Spilling
# At compile time:
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "spill"
# At runtime with ncu:
ncu --metrics \
l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,\
l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum \
./my_kernel
If local memory load/store sectors are non-zero, the kernel is spilling registers. Add __launch_bounds__ or reduce per-thread register pressure.
Putting It All Together: A Real Optimization Walkthrough
Let us walk through optimizing a softmax kernel as a concrete example. Softmax computes for each row of an input matrix. It requires three passes over each row: find the max, compute the exponentials and sum, normalize.
Version 1: Naive
__global__ void softmax_naive(float *input, float *output, int rows, int cols) {
int row = blockIdx.x;
if (row >= rows) return;
float *in_row = input + row * cols;
float *out_row = output + row * cols;
// Pass 1: find max (thread 0 only -- terrible!)
float max_val = -INFINITY;
if (threadIdx.x == 0) {
for (int j = 0; j < cols; j++) {
max_val = fmaxf(max_val, in_row[j]);
}
}
__shared__ float s_max;
if (threadIdx.x == 0) s_max = max_val;
__syncthreads();
// Pass 2: exp and sum (each thread handles one element -- poor reuse)
float sum = 0.0f;
for (int j = threadIdx.x; j < cols; j += blockDim.x) {
float val = expf(in_row[j] - s_max);
out_row[j] = val;
sum += val;
}
// Reduce sum across threads (naive: serial atomic)
__shared__ float s_sum;
if (threadIdx.x == 0) s_sum = 0.0f;
__syncthreads();
atomicAdd(&s_sum, sum);
__syncthreads();
// Pass 3: normalize
for (int j = threadIdx.x; j < cols; j += blockDim.x) {
out_row[j] /= s_sum;
}
}
Problems: Single-thread max reduction, 3 passes over global memory, atomic reduction, no vectorization.
Version 2: Optimized with Shared Memory, Warp Reductions, float4
__global__ void __launch_bounds__(256, 2)
softmax_optimized(float *input, float *output, int rows, int cols) {
int row = blockIdx.x;
if (row >= rows) return;
const int tid = threadIdx.x;
const int BLOCK = blockDim.x; // 256
const int vec_cols = cols / 4;
float4 *in_row = reinterpret_cast<float4*>(input + row * cols);
float4 *out_row = reinterpret_cast<float4*>(output + row * cols);
// Pass 1: find row max (all threads participate, vectorized loads)
float local_max = -INFINITY;
for (int j = tid; j < vec_cols; j += BLOCK) {
float4 v = in_row[j]; // Coalesced float4 load
local_max = fmaxf(local_max, fmaxf(fmaxf(v.x, v.y), fmaxf(v.z, v.w)));
}
// Warp-level reduction for max (no shared memory needed!)
for (int offset = 16; offset > 0; offset >>= 1) {
local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
}
// Block-level reduction via shared memory
__shared__ float s_warp_max[8]; // 256 threads / 32 = 8 warps
int warp_id = tid / 32;
int lane = tid % 32;
if (lane == 0) s_warp_max[warp_id] = local_max;
__syncthreads();
float row_max;
if (tid < 8) {
row_max = s_warp_max[tid];
for (int offset = 4; offset > 0; offset >>= 1) {
row_max = fmaxf(row_max, __shfl_down_sync(0xff, row_max, offset));
}
if (tid == 0) s_warp_max[0] = row_max;
}
__syncthreads();
row_max = s_warp_max[0];
// Pass 2: compute exp(x - max) and partial sums (vectorized)
float local_sum = 0.0f;
for (int j = tid; j < vec_cols; j += BLOCK) {
float4 v = in_row[j];
v.x = expf(v.x - row_max);
v.y = expf(v.y - row_max);
v.z = expf(v.z - row_max);
v.w = expf(v.w - row_max);
out_row[j] = v; // Coalesced float4 store
local_sum += v.x + v.y + v.z + v.w;
}
// Reduce sum (same warp + shared memory pattern as above)
for (int offset = 16; offset > 0; offset >>= 1) {
local_sum += __shfl_down_sync(0xffffffff, local_sum, offset);
}
__shared__ float s_warp_sum[8];
if (lane == 0) s_warp_sum[warp_id] = local_sum;
__syncthreads();
float row_sum;
if (tid < 8) {
row_sum = s_warp_sum[tid];
for (int offset = 4; offset > 0; offset >>= 1) {
row_sum += __shfl_down_sync(0xff, row_sum, offset);
}
if (tid == 0) s_warp_sum[0] = row_sum;
}
__syncthreads();
float inv_sum = 1.0f / s_warp_sum[0];
// Pass 3: normalize (vectorized)
for (int j = tid; j < vec_cols; j += BLOCK) {
float4 v = out_row[j];
v.x *= inv_sum;
v.y *= inv_sum;
v.z *= inv_sum;
v.w *= inv_sum;
out_row[j] = v;
}
}
Softmax Optimization Results (1024x4096, A100)
| Version | Time (us) | Bandwidth (GB/s) | Technique |
|---|---|---|---|
| Naive | 89 | 380 | Serial max, atomics, scalar loads |
| + Parallel max | 42 | 800 | All threads participate in reduction |
| + Warp shuffles | 35 | 960 | Eliminate shared memory for intra-warp reduction |
| + float4 loads | 28 | 1,200 | 4x fewer load instructions |
| + __launch_bounds__ | 25 | 1,340 | Better register allocation, 50% occupancy |
| PyTorch built-in | 27 | 1,240 | Reference (CUB-based) |
The optimized kernel achieves 1,340 GB/s on an A100 with 2,039 GB/s peak — 66% of peak bandwidth. This is close to the practical limit for a 3-pass algorithm (the data is read twice and written twice, so the effective data movement is ~4x the output size). The remaining gap is from instruction overhead and imperfect occupancy.
Texture and Constant Memory
These are specialized memory types worth mentioning briefly.
Constant Memory
Constant memory is a 64 KB read-only region cached in a dedicated constant cache. Its key property: when all threads in a warp read the same address, the value is broadcast to all threads in a single cycle. When threads read different addresses, accesses are serialized.
__constant__ float filter_weights[256];
__global__ void apply_filter(float *data, float *output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// All threads read the same weight -- broadcast, very fast
float w = filter_weights[0];
output[idx] = data[idx] * w;
}
Use case: Kernel parameters, lookup tables, filter coefficients — any small, read-only data that all threads access uniformly. If threads access different addresses, constant memory is slower than global memory due to serialization.
Texture Memory
Texture memory provides cached read-only access optimized for 2D spatial locality. The texture cache is separate from L1/L2 and uses a space-filling curve layout to keep 2D-neighboring elements close in cache.
// Modern CUDA uses texture objects (not the deprecated texture references)
cudaTextureObject_t tex;
// ... setup code ...
__global__ void texture_kernel(cudaTextureObject_t tex, float *output, int w, int h) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// Hardware handles boundary conditions and interpolation
float val = tex2D<float>(tex, x + 0.5f, y + 0.5f);
output[y * w + x] = val;
}
Use case: Image processing, volumetric rendering, any workload with 2D or 3D spatial access patterns. For linear algebra and ML workloads, texture memory is rarely useful — global memory with proper coalescing is faster.
Common Anti-Patterns and Fixes
Common Memory Anti-Patterns and Their Fixes
| Anti-Pattern | Achieved BW | Fix | Fixed BW | Speedup |
|---|---|---|---|---|
| AoS layout | ~400 GB/s | Convert to SoA | ~1,600 GB/s | 4x |
| Column-major access of row-major data | ~200 GB/s | Transpose or tile through SMEM | ~1,500 GB/s | 7.5x |
| Scalar loads (float) | ~1,400 GB/s | Vectorize to float4 | ~1,750 GB/s | 1.25x |
| Misaligned base address | ~1,400 GB/s | Align allocations to 128 bytes | ~1,750 GB/s | 1.25x |
| Small irregular gathers | ~100 GB/s | Sort indices + batch into coalesced runs | ~800 GB/s | 8x |
| 32-way SMEM bank conflict | ~0.09 TB/s | Pad array: [N][N+1] | ~2.8 TB/s | 31x |
| Register spilling | Varies | __launch_bounds__, reduce local vars | Varies | 1.5-5x |
Summary
GPU memory optimization is not a dark art — it follows a clear, measurable hierarchy of priorities:
-
Understand the hierarchy. Registers (free) > Shared memory (cheap) > L1/L2 cache (moderate) > Global memory (expensive). The goal is to minimize traffic to the expensive levels.
-
Fix coalescing first. This is the 10x optimization. Ensure consecutive threads access consecutive addresses. Convert AoS to SoA. This single change often transforms a kernel from 10% to 80% of peak bandwidth.
-
Add shared memory tiling where there is data reuse. If each global memory element is used more than once, load it into shared memory once and reuse it. Pad arrays as
[N][N+1]to eliminate bank conflicts. -
Vectorize loads. Use
float4for aligned, contiguous data. This reduces instruction count and improves memory throughput. -
Tune register allocation. Use
__launch_bounds__to control occupancy. Check for and eliminate register spilling. This is a fine-tuning step, not a first step. -
Profile with Nsight Compute. Do not guess. Measure sectors/requests for coalescing, check DRAM throughput percentage, look at bank conflicts, verify absence of register spills. The profiler tells you exactly where the bottleneck is.
-
Think about cache. Size your working set to fit in L2 when possible. Use L2 persistence hints for hot data on Ampere+. Be aware that the 128-byte cache line size means every misaligned or scattered access wastes cache capacity.
The memory hierarchy has not changed in principle across three generations of GPUs. HBM gets faster, caches get larger, shared memory grows — but the relative gaps remain. Registers are still ~1000x faster than HBM. Coalescing still determines whether you use 3% or 90% of peak bandwidth. The engineer who understands memory will always write faster kernels than the engineer who does not.