The GPU memory hierarchy determines kernel performance more than any other factor. A kernel that fits in L1 cache (192 KB per SM on H100) runs at 19 TB/s aggregate bandwidth across all SMs. The same kernel accessing data in L2 (50 MB on H100) runs at approximately 12 TB/s. Hit HBM (80 GB on H100), and you are at 3.35 TB/s. The 5.7x bandwidth gap between L1 and HBM means that understanding cache behavior โ what hits, what misses, and why โ directly determines whether a kernel achieves 20% or 90% of peak performance.
This post maps the complete GPU memory hierarchy, documents cache line sizes and eviction policies, analyzes how working set size affects effective bandwidth, and implements experiments to measure cache hit rates under different access patterns.
The Memory Hierarchy
Overview
GPU Memory Hierarchy (H100 SXM)
| Level | Size | Bandwidth | Latency (cycles) | Scope |
|---|---|---|---|---|
| Registers | 256 KB per SM | ~19 TB/s (aggregate) | 0 (same cycle) | Per-thread |
| Shared Memory | Up to 228 KB per SM | ~19 TB/s (aggregate) | ~20-30 | Per-block |
| L1 Cache | Up to 256 KB per SM (shared pool) | ~19 TB/s (aggregate) | ~30 | Per-SM |
| L2 Cache | 50 MB | ~12 TB/s | ~200-400 | GPU-wide |
| HBM3 | 80 GB | 3.35 TB/s | ~400-800 | GPU-wide |
| PCIe/NVLink | 64/900 GB/s | N/A | ~10,000+ | System |
L1 Cache and Shared Memory: The Configurable Pool
On Ampere and Hopper, the L1 cache and shared memory share a physical SRAM pool per SM. The programmer configures the split:
// Configure shared memory size for a kernel
// This implicitly sets the L1 cache size = pool_total - shared_memory
cudaFuncSetAttribute(
my_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
164 * 1024 // Request 164 KB shared memory
);
// On H100: pool is ~256 KB per SM
// If shared = 164 KB, then L1 = 256 - 164 = 92 KB
// If shared = 0 KB, then L1 = 256 KB (maximum L1)
// If shared = 228 KB (max), then L1 = 28 KB (minimum L1)
def l1_shared_tradeoff(pool_size_kb=256, shared_kb=0):
"""Analyze L1 vs shared memory tradeoff."""
l1_kb = pool_size_kb - shared_kb
print(f"Pool: {pool_size_kb} KB")
print(f"Shared memory: {shared_kb} KB")
print(f"L1 cache: {l1_kb} KB")
print()
configs = [
(0, "Maximum L1 (elementwise kernels, no shared mem needed)"),
(48, "Default (backward compat, small tile GEMM)"),
(100, "Medium shared (FlashAttention Q/K/V tiles)"),
(164, "Large shared (large GEMM tiles, Triton)"),
(228, "Maximum shared (CUTLASS large tiles, WGMMA)"),
]
print(f"{'Shared (KB)':<15} {'L1 (KB)':<10} {'Use Case'}")
for sm_kb, use_case in configs:
l1 = pool_size_kb - sm_kb
print(f"{sm_kb:<15} {l1:<10} {use_case}")
l1_shared_tradeoff()
L2 Cache: The Universal Intermediary
All traffic between SMs and HBM passes through the L2 cache. There is no way to bypass it (unlike CPU architectures where non-temporal stores can bypass L2).
def l2_cache_analysis(l2_size_mb=50, num_sms=132):
"""Analyze L2 cache characteristics on H100."""
l2_bytes = l2_size_mb * 1024 * 1024
# L2 is partitioned into slices, one per memory controller
# H100: 5 HBM3 stacks, each with 2 channels = 10 memory controllers
num_mc = 10
l2_per_mc = l2_bytes / num_mc
print(f"L2 total: {l2_size_mb} MB")
print(f"Memory controllers: {num_mc}")
print(f"L2 per memory controller: {l2_per_mc / 1024:.0f} KB")
print()
# Cache line: 128 bytes
cache_line = 128
num_lines = l2_bytes // cache_line
print(f"Cache line size: {cache_line} bytes")
print(f"Total cache lines: {num_lines:,}")
print(f"Associativity: ~16-32 way (architecture-dependent)")
print()
# What fits in L2?
model_sizes = {
"Llama-2 7B weights (FP16)": 14e9,
"Llama-2 7B weights (INT4)": 3.5e9,
"KV cache (1K tokens, 7B)": 0.5e9,
"KV cache (4K tokens, 7B)": 2e9,
"Activation buffer (bs=1, 4096 hidden)": 4096 * 2,
"Attention scores (32 heads, 4K tokens)": 32 * 4096 * 4096 * 2,
}
print("What fits in 50 MB L2?")
for name, size_bytes in model_sizes.items():
fits = "Yes" if size_bytes <= l2_bytes else "No"
print(f" {name}: {size_bytes/1e6:.1f} MB -> {fits}")
l2_cache_analysis()
Cache Line Behavior
128-Byte Cache Lines
Both L1 and L2 use 128-byte cache lines. A cache line is the minimum unit of transfer between memory levels.
def cache_line_analysis():
"""Analyze cache line access patterns."""
cache_line_bytes = 128
# FP16: 2 bytes per element
# One cache line holds 64 FP16 elements
fp16_per_line = cache_line_bytes // 2
print(f"FP16 elements per cache line: {fp16_per_line}")
# FP32: 4 bytes per element
fp32_per_line = cache_line_bytes // 4
print(f"FP32 elements per cache line: {fp32_per_line}")
# INT8: 1 byte per element
int8_per_line = cache_line_bytes // 1
print(f"INT8 elements per cache line: {int8_per_line}")
print()
# A warp of 32 threads reading consecutive FP16 values
# reads 32 * 2 = 64 bytes = 0.5 cache lines
# The hardware fetches 1 full cache line (128 bytes)
# Utilization: 64/128 = 50%
#
# A warp of 32 threads reading consecutive FP32 values
# reads 32 * 4 = 128 bytes = exactly 1 cache line
# Utilization: 128/128 = 100%
warp_size = 32
for dtype, elem_size in [("FP16", 2), ("FP32", 4), ("INT8", 1)]:
bytes_per_warp = warp_size * elem_size
lines_needed = (bytes_per_warp + cache_line_bytes - 1) // cache_line_bytes
utilization = bytes_per_warp / (lines_needed * cache_line_bytes)
print(f"Warp reading consecutive {dtype}: "
f"{bytes_per_warp} bytes, "
f"{lines_needed} cache line(s), "
f"{utilization*100:.0f}% utilization")
cache_line_analysis()
Cache Eviction: LRU-Like Policy
def cache_eviction_behavior():
"""Explain GPU cache eviction policies."""
policies = {
"L1 (data cache)": {
"policy": "LRU-like with sector tracking",
"details": "L1 tracks 128-byte lines but loads in "
"32-byte sectors. A line with no valid sectors "
"is evicted first.",
"write_policy": "Write-through to L2 (L1 is not write-back "
"for global memory)",
},
"L2": {
"policy": "LRU-like with multiple replacement policies",
"details": "L2 supports different eviction hints: "
"normal (LRU), streaming (evict-first), "
"persistent (evict-last).",
"write_policy": "Write-back to HBM (dirty lines are "
"written back on eviction)",
},
}
for level, info in policies.items():
print(f"\n{level}:")
for key, val in info.items():
print(f" {key}: {val}")
L2 Cache Residency Control (Hopper)
Persistent L2 Cache Lines
Hopper introduced the ability to mark a region of L2 as โpersistentโ โ these cache lines are evicted last, effectively reserving a portion of L2 for frequently accessed data.
// Set L2 persistence for a memory region
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// Maximum L2 that can be set aside for persistence
size_t l2_persist_max = prop.persistingL2CacheMaxSize;
// On H100: typically 40 MB of the 50 MB L2
// Set the persistent window size
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,
40 * 1024 * 1024); // 40 MB
// Mark a specific allocation as persistent
cudaStreamAttrValue stream_attr;
stream_attr.accessPolicyWindow.base_ptr = kv_cache_ptr;
stream_attr.accessPolicyWindow.num_bytes = kv_cache_size;
stream_attr.accessPolicyWindow.hitRatio = 1.0; // Keep all hits
stream_attr.accessPolicyWindow.hitProp =
cudaAccessPropertyPersisting; // Persist in L2
stream_attr.accessPolicyWindow.missProp =
cudaAccessPropertyStreaming; // Evict quickly
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attr);
Use Case: KV Cache in L2
def kv_cache_l2_analysis():
"""Analyze KV cache residency in L2."""
# KV cache size depends on model and context length
# Per layer: 2 (K+V) * batch * heads * seq_len * head_dim * dtype_bytes
configs = [
# (model, layers, heads, head_dim, batch, seq_len)
("Llama-2 7B", 32, 32, 128, 1, 512),
("Llama-2 7B", 32, 32, 128, 1, 2048),
("Llama-2 7B", 32, 32, 128, 1, 8192),
("Llama-2 7B", 32, 32, 128, 32, 2048),
("Llama-2 70B", 80, 8, 128, 1, 2048), # GQA: 8 KV heads
]
l2_size = 50 * 1024 * 1024 # 50 MB
print(f"{'Config':<45} {'KV Size':<12} {'Fits L2?'}")
for model, layers, kv_heads, head_dim, batch, seq in configs:
kv_bytes = 2 * layers * batch * kv_heads * seq * head_dim * 2
fits = "Yes" if kv_bytes <= l2_size else "No"
config_str = f"{model}, bs={batch}, seq={seq}"
print(f"{config_str:<45} {kv_bytes/1e6:>8.1f} MB {fits}")
kv_cache_l2_analysis()
KV Cache Size vs L2 Cache Capacity (FP16)
| Model | Batch | Seq Len | KV Cache Size | Fits H100 L2 (50 MB) |
|---|---|---|---|---|
| Llama-2 7B (GQA 32 heads) | 1 | 512 | 8.4 MB | Yes |
| Llama-2 7B (GQA 32 heads) | 1 | 2048 | 33.6 MB | Yes |
| Llama-2 7B (GQA 32 heads) | 1 | 8192 | 134.2 MB | No |
| Llama-2 7B (GQA 32 heads) | 32 | 2048 | 1073.7 MB | No |
| Llama-2 70B (GQA 8 heads) | 1 | 2048 | 26.2 MB | Yes |
Measuring Cache Behavior
Working Set Size Experiment
// Measure effective bandwidth at different working set sizes
// to determine L1, L2, and HBM bandwidth
__global__ void bandwidth_test(
const float* __restrict__ data,
float* __restrict__ output,
int working_set_elements,
int total_accesses
) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
float sum = 0.0f;
for (int i = tid; i < total_accesses; i += stride) {
// Access within the working set (wrap around)
int idx = i % working_set_elements;
sum += data[idx];
}
// Prevent dead code elimination
if (tid == 0) {
output[0] = sum;
}
}
import torch
import time
def measure_bandwidth_vs_working_set():
"""Measure effective bandwidth at different working set sizes.
Small working set (fits in L1): high bandwidth
Medium working set (fits in L2): moderate bandwidth
Large working set (HBM): lowest bandwidth
"""
device = 'cuda'
total_accesses = 256 * 1024 * 1024 # 256M accesses
# Working set sizes from 4 KB to 1 GB
working_set_sizes_kb = [
4, 8, 16, 32, 64, 128, 256, 512,
1024, 2048, 4096, 8192, 16384, 32768,
65536, 131072, 262144, 524288, 1048576
]
results = []
for ws_kb in working_set_sizes_kb:
ws_elements = ws_kb * 1024 // 4 # FP32, 4 bytes per element
data = torch.randn(ws_elements, device=device, dtype=torch.float32)
output = torch.zeros(1, device=device, dtype=torch.float32)
# Warmup
for _ in range(3):
output[0] = data[:min(1000, ws_elements)].sum()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# Launch kernel that reads total_accesses elements
# from the working set (wrapping around)
blocks = 256
threads = 256
accesses_per_thread = total_accesses // (blocks * threads)
start.record()
# Simulate by reading data in a loop
for _ in range(10):
_ = data.sum()
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / 10
bytes_read = ws_elements * 4 # Approximate
bandwidth_gb_s = (bytes_read / 1e9) / (elapsed_ms / 1000)
results.append((ws_kb, bandwidth_gb_s))
if ws_kb in [16, 128, 1024, 16384, 262144, 1048576]:
print(f"WS={ws_kb:>8d} KB: {bandwidth_gb_s:>8.1f} GB/s")
return results
# The output will show:
# - WS < ~200 KB: L1 bandwidth (~19 TB/s aggregate)
# - WS 200 KB - 50 MB: L2 bandwidth (~12 TB/s)
# - WS > 50 MB: HBM bandwidth (~3.35 TB/s)
Detecting Cache Boundaries
def identify_cache_boundaries(results):
"""Identify L1/L2/HBM boundaries from bandwidth measurements."""
prev_bw = results[0][1]
for i in range(1, len(results)):
ws_kb, bw = results[i]
ratio = bw / prev_bw
if ratio < 0.7 and prev_bw > 5000:
print(f"L1 -> L2 boundary: ~{results[i-1][0]} KB")
elif ratio < 0.5 and prev_bw > 2000:
print(f"L2 -> HBM boundary: ~{results[i-1][0]} KB "
f"({results[i-1][0]/1024:.0f} MB)")
prev_bw = bw
Effective Bandwidth vs Working Set Size (H100, per-SM)
(GB/s per SM)Access Pattern Effects on Cache Performance
Sequential vs Strided Access
// Sequential access: maximum spatial locality
__global__ void sequential_access(
const float* data, float* output, int n
) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
float sum = 0.0f;
for (int i = tid; i < n; i += stride) {
sum += data[i]; // Sequential: data[0], data[1], ...
}
output[tid] = sum;
}
// Strided access: cache line waste
__global__ void strided_access(
const float* data, float* output, int n, int stride_factor
) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int grid_stride = gridDim.x * blockDim.x;
float sum = 0.0f;
for (int i = tid; i < n / stride_factor; i += grid_stride) {
// Each access skips stride_factor elements
// Only 1/stride_factor of each cache line is useful
sum += data[i * stride_factor];
}
output[tid] = sum;
}
def analyze_strided_access():
"""Quantify cache line waste from strided access."""
cache_line = 128 # bytes
element_size = 4 # FP32
elements_per_line = cache_line // element_size # 32
strides = [1, 2, 4, 8, 16, 32, 64, 128]
print(f"{'Stride':<10} {'Useful bytes/line':<20} "
f"{'Utilization':<15} {'Effective BW'}")
for stride in strides:
# If stride >= elements_per_line, every access fetches a new line
# with only element_size useful bytes
useful = min(element_size * (elements_per_line // stride),
cache_line)
useful = max(useful, element_size)
utilization = useful / cache_line
# Assume 3350 GB/s HBM
effective_bw = 3350 * utilization
print(f"{stride:<10} {useful:<20} {utilization*100:>6.1f}%"
f" {effective_bw:>7.0f} GB/s")
analyze_strided_access()
Access Stride vs Effective Bandwidth (H100, FP32, from HBM)
| Stride (elements) | Cache Line Utilization | Effective BW (GB/s) | Relative |
|---|---|---|---|
| 1 (sequential) | 100% | 3350 | 1.00x |
| 2 | 50% | 1675 | 0.50x |
| 4 | 25% | 838 | 0.25x |
| 8 | 12.5% | 419 | 0.125x |
| 16 | 6.25% | 209 | 0.063x |
| 32 (one per line) | 3.1% | 105 | 0.031x |
Random Access Pattern
def random_access_analysis(l2_size_mb=50, cache_line=128):
"""Analyze random access pattern cache behavior."""
l2_bytes = l2_size_mb * 1024 * 1024
l2_lines = l2_bytes // cache_line
# Random access to a working set:
# If working_set_lines < l2_lines, most accesses hit L2
# If working_set_lines > l2_lines, most accesses miss L2
working_sets_mb = [1, 5, 10, 25, 50, 100, 500, 1000]
print(f"{'WS (MB)':<10} {'WS Lines':<12} {'L2 Lines':<10} "
f"{'Est Hit Rate'}")
for ws_mb in working_sets_mb:
ws_bytes = ws_mb * 1024 * 1024
ws_lines = ws_bytes // cache_line
# Simplified model: if WS fits in L2, ~100% hit rate
# If WS is 2x L2, ~50% hit rate (LRU model)
if ws_lines <= l2_lines:
hit_rate = min(1.0, l2_lines / max(ws_lines, 1))
else:
hit_rate = l2_lines / ws_lines
print(f"{ws_mb:<10} {ws_lines:<12,} {l2_lines:<10,} "
f"{hit_rate*100:>6.1f}%")
random_access_analysis()
During attention decode, each new token attends to all previous positions in the KV cache. The access pattern is effectively random within the KV cache region. If the KV cache fits in L2 (short context, small batch), attention runs at L2 bandwidth. If it exceeds L2, attention drops to HBM bandwidth. This is why longer contexts are disproportionately slower โ it is not just the O(N) compute, but the L2 miss rate.
Practical Optimization Strategies
Strategy 1: Minimize Working Set
def working_set_optimization():
"""Strategies to keep the working set in faster cache."""
strategies = {
"Quantization (INT4/FP8)": {
"effect": "4x/2x smaller weights -> 4x/2x more fits in L2",
"example": "70B model INT4 weights: 35 GB. Per-layer: "
"~440 MB -> still exceeds L2, but sub-layer "
"tiles may fit",
},
"Tiling": {
"effect": "Process data in tiles that fit in L1/shared memory",
"example": "FlashAttention processes attention in "
"256-token blocks that fit in shared memory",
},
"GQA (Grouped Query Attention)": {
"effect": "Fewer KV heads -> smaller KV cache -> "
"more likely to fit in L2",
"example": "Llama-2 70B: 8 KV heads vs 64 Q heads -> "
"8x smaller KV cache",
},
"Operator fusion": {
"effect": "Keep intermediate results in registers/shared mem "
"instead of writing to HBM",
"example": "Fused bias+GELU: intermediate tensor never "
"touches HBM",
},
}
for name, info in strategies.items():
print(f"\n{name}:")
print(f" Effect: {info['effect']}")
print(f" Example: {info['example']}")
Strategy 2: Software Prefetching
// Use cp.async to prefetch data before it is needed
// This hides HBM latency by overlapping load with compute
__global__ void prefetched_kernel(
const float* __restrict__ data,
float* __restrict__ output,
int n,
int iterations
) {
extern __shared__ float smem[];
int tid = threadIdx.x;
int block_offset = blockIdx.x * blockDim.x;
// Prefetch first tile
if (block_offset + tid < n) {
// cp.async: asynchronous copy from global to shared
asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n"
:: "r"((unsigned)(&smem[tid])),
"l"(&data[block_offset + tid]));
}
asm volatile("cp.async.commit_group;\n");
asm volatile("cp.async.wait_group 0;\n");
__syncthreads();
// Process tile from shared memory while next tile loads
float result = smem[tid] * 2.0f; // Example computation
output[block_offset + tid] = result;
}
Summary
The GPU memory hierarchy spans 4 levels with a combined 5.7x bandwidth gap between the fastest (L1 at ~19 TB/s aggregate) and slowest (HBM at 3.35 TB/s). Cache line size is 128 bytes at both L1 and L2. Access patterns that waste cache lines (strided or random) can reduce effective bandwidth by 8-32x. Working set size determines which cache level serves the data: under 200 KB hits L1, under 50 MB (H100) hits L2, everything else hits HBM.
For LLM inference, the key cache behavior is the KV cache during attention. Short contexts with small batches fit in L2 (high bandwidth). Long contexts exceed L2 and fall to HBM bandwidth โ this is the primary reason why per-token latency increases with context length beyond a threshold. Optimization strategies focus on reducing the effective working set: quantization (smaller data), tiling (process in cache-sized chunks), and GQA (fewer KV heads).