The L2 cache is the last level of on-chip cache before HBM. On H100, it is 50 MB — large enough to hold a 12.5-million-element float array, but small enough that a single large GEMM can blow through it in microseconds. Every byte that misses L2 must be fetched from HBM at 3,350 GB/s — a 3-4x bandwidth reduction compared to L2 hit bandwidth of approximately 12 TB/s. For LLM inference, where the KV cache can span hundreds of megabytes, the fraction of KV cache that fits in L2 directly determines attention kernel throughput.
This post covers the L2 cache architecture, the eviction policy, working set analysis, Hopper’s L2 residency control mechanism, multi-kernel cache interference, and how to design kernels that maximize L2 hit rates.
L2 Cache Architecture
Size and Organization
L2 Cache Specifications Across GPU Generations
| Specification | V100 (Volta) | A100 (Ampere) | H100 (Hopper) | B200 (Blackwell) |
|---|---|---|---|---|
| L2 cache size | 6 MB | 40 MB | 50 MB | ~96 MB (est.) |
| Cache line size | 128 bytes | 128 bytes | 128 bytes | 128 bytes |
| Associativity | 16-way | 16-way (estimated) | 16-way (estimated) | TBD |
| Total cache lines | 49,152 | 327,680 | 409,600 | ~786,432 |
| L2 bandwidth (peak) | ~4 TB/s | ~5 TB/s | ~12 TB/s | ~18 TB/s (est.) |
| L2 latency | ~200 cycles | ~200 cycles | ~200 cycles | ~200 cycles |
| L2 sectors per line | 4 x 32B | 4 x 32B | 4 x 32B | 4 x 32B |
| Memory controllers | 8 (4 HBM2 stacks) | 10 (5 HBM2e stacks) | 12 (6 HBM3 stacks) | 16 (8 HBM3e stacks) |
Cache Line and Sector Structure
The L2 cache operates in 128-byte cache lines, divided into 4 sectors of 32 bytes each. A memory request from an SM targets specific sectors within a line. If the requested sectors are present in L2, only those sectors are returned — the full 128-byte line is not necessarily populated.
// L2 Cache Line Structure (128 bytes)
// [Sector 0: 32B] [Sector 1: 32B] [Sector 2: 32B] [Sector 3: 32B]
// Tag Valid bits per sector
// A coalesced 128-byte load from a warp (32 threads x 4 bytes each):
// Touches all 4 sectors → full line fetch on miss
// A 32-byte partial load (8 threads x 4 bytes, others masked):
// Touches 1 sector → only that sector fetched on miss
// But the full tag is allocated, and other sectors can be fetched later
Address Hashing and L2 Slices
The L2 cache is physically distributed across memory controller slices. The GPU uses address hashing to map memory addresses to L2 slices, distributing traffic evenly:
// Simplified address mapping (actual hash is proprietary):
// Address bits [10:7] → L2 slice index (for 12 slices on H100)
// This means addresses 0, 4096, 8192, ... map to the same slice
// Strided access patterns with stride = N * 4096 can cause slice conflicts
// In practice, NVIDIA's hash function is designed to avoid power-of-2 conflicts
// but extreme stride patterns can still create hotspots
Eviction Policy
LRU-Like with Adaptive Behavior
NVIDIA L2 caches use an eviction policy that is approximately LRU (Least Recently Used) with adaptive modifications. The exact policy is not documented, but empirical measurements reveal:
- Recently accessed lines are preserved. Lines accessed within the last N accesses (where N is proportional to the associativity) are very likely to remain resident.
- Streaming data gets lower priority. The cache detects sequential access patterns and assigns lower priority to streaming data, preventing it from evicting high-reuse data.
- Sector-level tracking. Eviction decisions consider sector-level validity — a line with all 4 sectors valid may be preserved over a line with only 1 sector valid.
// Measuring L2 eviction behavior
// Access an array repeatedly, measuring time per access as size increases
__global__ void l2_probe(float* data, float* output, int array_size, int iters) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int iter = 0; iter < iters; iter++) {
for (int i = tid; i < array_size; i += blockDim.x * gridDim.x) {
sum += data[i];
}
}
output[tid] = sum;
}
// Sweep array_size from 1 MB to 200 MB
// Measure effective bandwidth at each size
// When array_size > L2 size → sharp bandwidth drop (L2 miss)
Effective Bandwidth vs Working Set Size (H100)
(GB/s)The Working Set Cliff
The transition from L2-resident to HBM-resident is not a sharp cliff but a gradual degradation. As the working set exceeds L2 capacity, the hit rate drops proportionally:
For an H100 with a 60 MB working set (50 MB L2):
- Approximate hit rate: (assuming uniform access)
- GB/s
For a 200 MB working set:
- Hit rate:
- GB/s
In practice, access patterns are not uniform. Hot data (frequently accessed elements) stays in cache while cold data (accessed once) is evicted quickly. Kernels with temporal locality perform much better than this simple model predicts.
L2 Residency Control (Hopper)
The Problem
In multi-kernel workloads, one kernel’s data can evict another kernel’s data from L2. For LLM inference, the attention kernel benefits from having the KV cache in L2, but a preceding GEMM kernel can flush the entire L2 with weight matrix data that is only used once.
Hopper’s Solution: cudaAccessPolicyWindow
Hopper introduces explicit L2 cache residency control. You can specify a region of memory that should be given priority in L2, and a region that should be treated as streaming (low priority):
// Set L2 persistence policy for the KV cache
cudaAccessPolicyWindow policy = {};
policy.base_ptr = kv_cache_ptr; // Address of KV cache
policy.num_bytes = kv_cache_size; // Size of KV cache
policy.hitRatio = 1.0f; // Try to keep 100% in L2
policy.hitProp = cudaAccessPropertyPersisting; // Persist in L2
policy.missProp = cudaAccessPropertyStreaming; // If miss, treat as streaming
cudaStreamSetAccessPolicyWindow(stream, &policy);
// Launch attention kernel — KV cache will be prioritized in L2
attention_kernel<<<grid, block, 0, stream>>>(kv_cache_ptr, ...);
// Reset policy after the kernel
policy = {};
cudaStreamSetAccessPolicyWindow(stream, &policy);
For a 7B parameter LLM with a KV cache of 32 MB (fits in L2), setting persistent access policy for the KV cache can improve attention kernel throughput by 30-50%. The GEMM kernels that access the weight matrices (14 GB, far exceeding L2) should be set to streaming policy to avoid evicting the KV cache.
Policy Properties
// Three access properties:
cudaAccessPropertyNormal // Default: standard LRU behavior
cudaAccessPropertyStreaming // Low priority: evict first when cache is full
cudaAccessPropertyPersisting // High priority: evict last, keep in L2
// hitRatio controls what fraction of the range gets persistent treatment
// hitRatio = 0.5: half the data gets persistent, half gets streaming
// This is useful when only part of a buffer is hot (e.g., recent KV cache entries)
Partitioning L2 Between Persistent and Non-Persistent
The total amount of L2 available for persistent data is controlled by:
// Query the maximum persistent L2 size
int persisting_l2_size;
cudaDeviceGetAttribute(&persisting_l2_size,
cudaDevAttrMaxPersistingL2CacheSize, 0);
// On H100: returns ~50 MB (entire L2 can be persistent)
// Set the amount of L2 reserved for persistent data
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 32 * 1024 * 1024);
// Reserve 32 MB for persistent data, 18 MB for normal caching
Reserving too much L2 for persistent data starves other kernels of cache capacity. If you reserve 48 MB of the 50 MB L2 for persistent data, only 2 MB remains for all other memory accesses — effectively disabling L2 caching for non-persistent data. The optimal partition depends on the workload. For LLM inference: reserve enough for the KV cache, leave the rest for weight/activation data.
Measuring L2 Cache Performance
Nsight Compute L2 Metrics
# L2 hit rate and throughput
ncu --metrics \
lts__t_sectors_srcunit_tex_op_read.sum,\
lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum,\
lts__t_request_cycles.sum,\
lts__t_sectors.sum.per_second \
-k my_kernel ./my_app
# Hit rate = hit / (hit + miss)
# Throughput = sectors * 32 bytes / kernel_time
Building an L2 Cache Probe
// L2 cache bandwidth probe: measures effective bandwidth at different working set sizes
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void bandwidth_probe(const float4* __restrict__ data,
float* __restrict__ output,
int elements, int iterations) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
float4 sum = make_float4(0, 0, 0, 0);
for (int iter = 0; iter < iterations; iter++) {
for (int i = tid; i < elements; i += stride) {
float4 val = data[i];
sum.x += val.x;
sum.y += val.y;
sum.z += val.z;
sum.w += val.w;
}
}
if (tid == 0) output[0] = sum.x + sum.y + sum.z + sum.w;
}
int main() {
// Sweep working set sizes from 1 MB to 256 MB
size_t sizes[] = {
1 << 20, 2 << 20, 4 << 20, 8 << 20, 16 << 20,
32 << 20, 48 << 20, 50 << 20, 64 << 20, 128 << 20, 256 << 20
};
for (int s = 0; s < sizeof(sizes)/sizeof(sizes[0]); s++) {
size_t bytes = sizes[s];
int elements = bytes / sizeof(float4);
int iterations = 100;
float4* d_data;
float* d_output;
cudaMalloc(&d_data, bytes);
cudaMalloc(&d_output, sizeof(float));
// Warmup
bandwidth_probe<<<256, 256>>>(d_data, d_output, elements, 10);
cudaDeviceSynchronize();
// Timed run
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
bandwidth_probe<<<256, 256>>>(d_data, d_output, elements, iterations);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
double total_bytes = (double)bytes * iterations;
double bandwidth_gbps = total_bytes / (ms * 1e-3) / 1e9;
printf("Working set: %6zu MB Bandwidth: %8.1f GB/s\n",
bytes >> 20, bandwidth_gbps);
cudaFree(d_data);
cudaFree(d_output);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
return 0;
}
# Expected output on H100:
# Working set: 1 MB Bandwidth: 11842.3 GB/s
# Working set: 2 MB Bandwidth: 11790.1 GB/s
# Working set: 4 MB Bandwidth: 11650.8 GB/s
# Working set: 8 MB Bandwidth: 11520.4 GB/s
# Working set: 16 MB Bandwidth: 11200.7 GB/s
# Working set: 32 MB Bandwidth: 10800.2 GB/s
# Working set: 48 MB Bandwidth: 8900.5 GB/s
# Working set: 50 MB Bandwidth: 7600.3 GB/s ← L2 boundary
# Working set: 64 MB Bandwidth: 4800.1 GB/s
# Working set: 128 MB Bandwidth: 3350.8 GB/s ← HBM limited
# Working set: 256 MB Bandwidth: 3320.5 GB/s
Cache-Aware Kernel Design Patterns
Pattern 1: Tile to L2
Structure kernels so that each tile of work operates on a data footprint that fits in L2:
// Matrix multiply with L2-aware tiling
// C[M,N] = A[M,K] * B[K,N]
// Process K in chunks that fit in L2
// L2 = 50 MB. If A tile is M x TILE_K and B tile is TILE_K x N:
// Data footprint = M * TILE_K + TILE_K * N (floats)
// For M=N=4096, TILE_K = 50MB / (2 * 4096 * 4 bytes) ≈ 1536
// But this is the global memory footprint.
// In practice, each SM only accesses a portion.
// The effective L2 working set is the union of all SMs' access patterns.
void tiled_gemm(float* C, const float* A, const float* B,
int M, int N, int K) {
int TILE_K = 1024; // Chosen to keep A_tile + B_tile in L2
for (int k = 0; k < K; k += TILE_K) {
int k_size = min(TILE_K, K - k);
gemm_kernel<<<grid, block>>>(C, A + k, B + k * N,
M, N, K, k_size);
}
}
Pattern 2: Persistent Kernels for L2 Reuse
A persistent kernel launches a fixed number of thread blocks (one per SM) that loop over work items. This keeps the thread blocks’ shared data resident in L2:
__global__ void persistent_attention(
const float* __restrict__ Q, // Queries
const float* __restrict__ K, // Keys (KV cache) — want this in L2
const float* __restrict__ V, // Values (KV cache)
float* __restrict__ O, // Output
int num_heads, int seq_len, int head_dim) {
// One block per SM, loops over heads
int sm_id = blockIdx.x;
int num_sms = gridDim.x;
for (int head = sm_id; head < num_heads; head += num_sms) {
// Each iteration accesses K and V for this head
// If head_dim * seq_len fits in L2, K/V stay cached
// across iterations on the same SM
const float* q_head = Q + head * head_dim;
const float* k_head = K + head * seq_len * head_dim;
const float* v_head = V + head * seq_len * head_dim;
// Compute attention for this head
// ... (FlashAttention-style tiled computation)
O[head * head_dim + threadIdx.x] = /* result */;
}
}
// Launch with exactly num_SMs blocks to maximize L2 reuse
persistent_attention<<<132, 256>>>(Q, K, V, O, num_heads, seq_len, head_dim);
Pattern 3: Data Layout for Cache Line Alignment
// Misaligned structure: 20 bytes, crosses cache line boundaries
struct MisalignedData {
float x, y, z; // 12 bytes
int flags; // 4 bytes
float weight; // 4 bytes
}; // 20 bytes — not a power of 2, causes partial cache line usage
// Aligned structure: 32 bytes, exactly one sector per element
struct __align__(32) AlignedData {
float x, y, z; // 12 bytes
int flags; // 4 bytes
float weight; // 4 bytes
float padding[3]; // 12 bytes padding
}; // 32 bytes — one sector per element, no wasted cache traffic
Pattern 4: Stream-and-Compute to Avoid L2 Pollution
For data that is accessed exactly once (streaming), bypass or minimize L2 footprint:
// Use L2 streaming hints (Ampere+)
// Load with cache bypass for streaming data
__global__ void stream_kernel(const float* __restrict__ stream_data,
const float* __restrict__ reuse_data,
float* __restrict__ output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
// Load streaming data with cache-global (L2 only, not L1)
float stream_val;
asm volatile("ld.global.cg.f32 %0, [%1];"
: "=f"(stream_val) : "l"(stream_data + idx));
// Load reuse data normally (cached in both L1 and L2)
float reuse_val = reuse_data[idx % 1024]; // Small, reused array
output[idx] = stream_val * reuse_val;
}
L2 Cache Load Modifiers
| Modifier | PTX Instruction | L1 Cache | L2 Cache | Use Case |
|---|---|---|---|---|
| Default (.ca) | ld.global.ca | Cached | Cached | General purpose, data reuse expected |
| Cache global (.cg) | ld.global.cg | Not cached | Cached | Data reused across SMs, not within SM |
| Cache streaming (.cs) | ld.global.cs | Not cached | Streaming (low priority) | Data accessed once, avoid L2 pollution |
| Last use (.lu) | ld.global.lu | Not cached | Evict after use | Data consumed exactly once |
| Cache volatile (.cv) | ld.global.cv | Not cached | Not cached | Bypass all caches (volatile data) |
The GPU hardware treats cache modifiers as hints. Under memory pressure, even “persisting” data may be evicted. The hardware prioritizes correctness over the hint. Similarly, “streaming” data may still occupy L2 lines temporarily. The benefit comes from the eviction priority: streaming data is evicted before persisting data when the cache is full.
Multi-Kernel L2 Interference
The Problem: Sequential Kernels Thrash L2
In a typical LLM inference pipeline:
- GEMM kernel (QKV projection): Reads weight matrix (hundreds of MB), writes activations.
- Attention kernel: Reads Q, K, V (KV cache may be 10-50 MB). Wants K/V in L2.
- GEMM kernel (output projection): Reads another weight matrix, evicts K/V from L2.
- FFN GEMM kernels: Two large GEMMs, further thrashing L2.
Each GEMM reads a weight matrix much larger than L2, evicting everything. The attention kernel that follows finds none of its KV cache in L2.
Solution: Operator Fusion and Scheduling
// Approach 1: Fuse attention + output projection
// Keep KV cache in L2 by not launching an intervening GEMM
// Approach 2: Use L2 residency control (Hopper)
// Mark KV cache as persistent before the GEMM
cudaAccessPolicyWindow kv_policy = {};
kv_policy.base_ptr = kv_cache;
kv_policy.num_bytes = kv_cache_size;
kv_policy.hitRatio = 1.0f;
kv_policy.hitProp = cudaAccessPropertyPersisting;
kv_policy.missProp = cudaAccessPropertyStreaming;
cudaStreamSetAccessPolicyWindow(stream, &kv_policy);
// GEMM kernel runs — its weight data gets streaming priority
// KV cache stays in L2 despite the GEMM traffic
gemm_kernel<<<...>>>(weights, activations, output);
// Attention kernel runs — KV cache is still in L2
attention_kernel<<<...>>>(Q, K, V, output);
Measuring L2 Interference
# Profile L2 hit rates with and without residency control
# Without:
ncu --metrics lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum \
-k attention_kernel ./inference_no_policy
# With:
ncu --metrics lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum \
-k attention_kernel ./inference_with_policy
# Expected: hit rate jumps from ~15% to ~85% with residency control
L2 Cache and Multi-GPU Considerations
NVLink and L2
On NVLink-connected GPUs (DGX H100), one GPU can access another GPU’s HBM via NVLink. These remote accesses do not go through the remote GPU’s L2 cache — they access remote HBM directly. However, the local GPU caches the remote data in its own L2:
// GPU 0 accesses GPU 1's memory:
// 1. GPU 0 L1 miss → GPU 0 L2 lookup (local)
// 2. GPU 0 L2 miss → NVLink request to GPU 1
// 3. GPU 1 HBM responds (bypasses GPU 1's L2)
// 4. Data stored in GPU 0's L2 (for future local reuse)
// Latency: ~2-3 microseconds (NVLink round trip)
MIG and L2 Partitioning
When MIG (Multi-Instance GPU) is enabled, the L2 cache is physically partitioned along with the SMs and memory controllers. Each MIG instance gets a dedicated slice of L2:
// H100 with MIG enabled:
// 7x 1g.10gb instances:
// Each gets: ~7 MB L2, ~10 GB HBM, 16-19 SMs
// 1x 7g.80gb instance (full GPU):
// Gets: 50 MB L2, 80 GB HBM, 132 SMs
// 2x 3g.40gb instances:
// Each gets: ~21 MB L2, 40 GB HBM, ~44 SMs
The L2 partition is enforced by hardware. One MIG instance cannot pollute another’s L2 cache.
Summary
The L2 cache is the bandwidth multiplier for GPU workloads. Kernels that keep their working set under L2 capacity (50 MB on H100) run at 3-4x the bandwidth of kernels that must access HBM. The practical strategies: size tiles to fit in L2, use persistent kernels to maintain L2 residency across iterations, apply Hopper’s cudaAccessPolicyWindow to protect hot data from eviction by streaming traffic, use cache modifiers (ld.global.cs) to mark single-use data as streaming, and measure L2 hit rates with Nsight Compute to validate that cache-aware designs actually improve performance.
For LLM inference specifically: the KV cache size relative to L2 capacity is the critical metric. If the KV cache fits in L2 (context length num_layers num_heads head_dim 2 precision_bytes MB), attention kernels run at L2 bandwidth. If it does not, attention becomes HBM-bandwidth-limited — and the only solution is either reducing the KV cache (GQA, quantization) or using hardware with larger L2 (Blackwell’s ~96 MB).