L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design

Part of Series GPU Hardware & AI Accelerators 13 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The L2 cache is the last level of on-chip cache before HBM. On H100, it is 50 MB — large enough to hold a 12.5-million-element float array, but small enough that a single large GEMM can blow through it in microseconds. Every byte that misses L2 must be fetched from HBM at 3,350 GB/s — a 3-4x bandwidth reduction compared to L2 hit bandwidth of approximately 12 TB/s. For LLM inference, where the KV cache can span hundreds of megabytes, the fraction of KV cache that fits in L2 directly determines attention kernel throughput.

This post covers the L2 cache architecture, the eviction policy, working set analysis, Hopper’s L2 residency control mechanism, multi-kernel cache interference, and how to design kernels that maximize L2 hit rates.

L2 Cache Architecture

Size and Organization

📊

L2 Cache Specifications Across GPU Generations

Specification	V100 (Volta)	A100 (Ampere)	H100 (Hopper)	B200 (Blackwell)
L2 cache size	6 MB	40 MB	50 MB	~96 MB (est.)
Cache line size	128 bytes	128 bytes	128 bytes	128 bytes
Associativity	16-way	16-way (estimated)	16-way (estimated)	TBD
Total cache lines	49,152	327,680	409,600	~786,432
L2 bandwidth (peak)	~4 TB/s	~5 TB/s	~12 TB/s	~18 TB/s (est.)
L2 latency	~200 cycles	~200 cycles	~200 cycles	~200 cycles
L2 sectors per line	4 x 32B	4 x 32B	4 x 32B	4 x 32B
Memory controllers	8 (4 HBM2 stacks)	10 (5 HBM2e stacks)	12 (6 HBM3 stacks)	16 (8 HBM3e stacks)

Note: L2 cache has grown 8x from V100 to B200. Each memory controller manages a slice of L2, and traffic is distributed by address hashing.

Cache Line and Sector Structure

The L2 cache operates in 128-byte cache lines, divided into 4 sectors of 32 bytes each. A memory request from an SM targets specific sectors within a line. If the requested sectors are present in L2, only those sectors are returned — the full 128-byte line is not necessarily populated.

// L2 Cache Line Structure (128 bytes)
// [Sector 0: 32B] [Sector 1: 32B] [Sector 2: 32B] [Sector 3: 32B]
//      Tag                     Valid bits per sector

// A coalesced 128-byte load from a warp (32 threads x 4 bytes each):
// Touches all 4 sectors → full line fetch on miss

// A 32-byte partial load (8 threads x 4 bytes, others masked):
// Touches 1 sector → only that sector fetched on miss
// But the full tag is allocated, and other sectors can be fetched later

Address Hashing and L2 Slices

The L2 cache is physically distributed across memory controller slices. The GPU uses address hashing to map memory addresses to L2 slices, distributing traffic evenly:

// Simplified address mapping (actual hash is proprietary):
// Address bits [10:7] → L2 slice index (for 12 slices on H100)
// This means addresses 0, 4096, 8192, ... map to the same slice
// Strided access patterns with stride = N * 4096 can cause slice conflicts

// In practice, NVIDIA's hash function is designed to avoid power-of-2 conflicts
// but extreme stride patterns can still create hotspots

Eviction Policy

LRU-Like with Adaptive Behavior

NVIDIA L2 caches use an eviction policy that is approximately LRU (Least Recently Used) with adaptive modifications. The exact policy is not documented, but empirical measurements reveal:

Recently accessed lines are preserved. Lines accessed within the last N accesses (where N is proportional to the associativity) are very likely to remain resident.
Streaming data gets lower priority. The cache detects sequential access patterns and assigns lower priority to streaming data, preventing it from evicting high-reuse data.
Sector-level tracking. Eviction decisions consider sector-level validity — a line with all 4 sectors valid may be preserved over a line with only 1 sector valid.

// Measuring L2 eviction behavior
// Access an array repeatedly, measuring time per access as size increases
__global__ void l2_probe(float* data, float* output, int array_size, int iters) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    for (int iter = 0; iter < iters; iter++) {
        for (int i = tid; i < array_size; i += blockDim.x * gridDim.x) {
            sum += data[i];
        }
    }
    output[tid] = sum;
}

// Sweep array_size from 1 MB to 200 MB
// Measure effective bandwidth at each size
// When array_size > L2 size → sharp bandwidth drop (L2 miss)

Effective Bandwidth vs Working Set Size (H100)

(GB/s)

1 MB (fits in L2) ~12 TB/s — L2 hit

11,800 GB/s

10 MB (fits in L2) ~11.5 TB/s

11,500 GB/s

40 MB (fits in L2) ~10.2 TB/s — L2 filling up

10,200 GB/s

50 MB (L2 boundary) ~7.6 TB/s — partial misses

7,600 GB/s

80 MB (exceeds L2) ~3.3 TB/s — HBM limited

3,300 GB/s

160 MB (2x L2) ~3.2 TB/s — full HBM

3,200 GB/s

The Working Set Cliff

The transition from L2-resident to HBM-resident is not a sharp cliff but a gradual degradation. As the working set exceeds L2 capacity, the hit rate drops proportionally:

$\text{effective\_bandwidth} = \text{hit\_rate} \times \text{BW}_{L2} + (1 - \text{hit\_rate}) \times \text{BW}_{HBM}$

For an H100 with a 60 MB working set (50 MB L2):

Approximate hit rate: $50/60 = 83\%$ (assuming uniform access)
$\text{effective\_BW} = 0.83 \times 12000 + 0.17 \times 3350 = 9960 + 570 = 10530$ GB/s

For a 200 MB working set:

Hit rate: $50/200 = 25\%$
$\text{effective\_BW} = 0.25 \times 12000 + 0.75 \times 3350 = 3000 + 2513 = 5513$ GB/s

In practice, access patterns are not uniform. Hot data (frequently accessed elements) stays in cache while cold data (accessed once) is evicted quickly. Kernels with temporal locality perform much better than this simple model predicts.

L2 Residency Control (Hopper)

The Problem

In multi-kernel workloads, one kernel’s data can evict another kernel’s data from L2. For LLM inference, the attention kernel benefits from having the KV cache in L2, but a preceding GEMM kernel can flush the entire L2 with weight matrix data that is only used once.

Hopper’s Solution: `cudaAccessPolicyWindow`

Hopper introduces explicit L2 cache residency control. You can specify a region of memory that should be given priority in L2, and a region that should be treated as streaming (low priority):

// Set L2 persistence policy for the KV cache
cudaAccessPolicyWindow policy = {};
policy.base_ptr = kv_cache_ptr;                    // Address of KV cache
policy.num_bytes = kv_cache_size;                   // Size of KV cache
policy.hitRatio = 1.0f;                             // Try to keep 100% in L2
policy.hitProp = cudaAccessPropertyPersisting;      // Persist in L2
policy.missProp = cudaAccessPropertyStreaming;       // If miss, treat as streaming

cudaStreamSetAccessPolicyWindow(stream, &policy);

// Launch attention kernel — KV cache will be prioritized in L2
attention_kernel<<<grid, block, 0, stream>>>(kv_cache_ptr, ...);

// Reset policy after the kernel
policy = {};
cudaStreamSetAccessPolicyWindow(stream, &policy);

⚡ L2 Residency Control Impact on LLM Inference

For a 7B parameter LLM with a KV cache of 32 MB (fits in L2), setting persistent access policy for the KV cache can improve attention kernel throughput by 30-50%. The GEMM kernels that access the weight matrices (14 GB, far exceeding L2) should be set to streaming policy to avoid evicting the KV cache.

Policy Properties

// Three access properties:
cudaAccessPropertyNormal      // Default: standard LRU behavior
cudaAccessPropertyStreaming    // Low priority: evict first when cache is full
cudaAccessPropertyPersisting   // High priority: evict last, keep in L2

// hitRatio controls what fraction of the range gets persistent treatment
// hitRatio = 0.5: half the data gets persistent, half gets streaming
// This is useful when only part of a buffer is hot (e.g., recent KV cache entries)

Partitioning L2 Between Persistent and Non-Persistent

The total amount of L2 available for persistent data is controlled by:

// Query the maximum persistent L2 size
int persisting_l2_size;
cudaDeviceGetAttribute(&persisting_l2_size,
                       cudaDevAttrMaxPersistingL2CacheSize, 0);
// On H100: returns ~50 MB (entire L2 can be persistent)

// Set the amount of L2 reserved for persistent data
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, 32 * 1024 * 1024);
// Reserve 32 MB for persistent data, 18 MB for normal caching

⚠️ Over-Reserving Persistent L2

Reserving too much L2 for persistent data starves other kernels of cache capacity. If you reserve 48 MB of the 50 MB L2 for persistent data, only 2 MB remains for all other memory accesses — effectively disabling L2 caching for non-persistent data. The optimal partition depends on the workload. For LLM inference: reserve enough for the KV cache, leave the rest for weight/activation data.

Measuring L2 Cache Performance

Nsight Compute L2 Metrics

# L2 hit rate and throughput
ncu --metrics \
  lts__t_sectors_srcunit_tex_op_read.sum,\
  lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
  lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum,\
  lts__t_request_cycles.sum,\
  lts__t_sectors.sum.per_second \
  -k my_kernel ./my_app

# Hit rate = hit / (hit + miss)
# Throughput = sectors * 32 bytes / kernel_time

Building an L2 Cache Probe

// L2 cache bandwidth probe: measures effective bandwidth at different working set sizes
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void bandwidth_probe(const float4* __restrict__ data,
                                float* __restrict__ output,
                                int elements, int iterations) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    float4 sum = make_float4(0, 0, 0, 0);

    for (int iter = 0; iter < iterations; iter++) {
        for (int i = tid; i < elements; i += stride) {
            float4 val = data[i];
            sum.x += val.x;
            sum.y += val.y;
            sum.z += val.z;
            sum.w += val.w;
        }
    }
    if (tid == 0) output[0] = sum.x + sum.y + sum.z + sum.w;
}

int main() {
    // Sweep working set sizes from 1 MB to 256 MB
    size_t sizes[] = {
        1 << 20, 2 << 20, 4 << 20, 8 << 20, 16 << 20,
        32 << 20, 48 << 20, 50 << 20, 64 << 20, 128 << 20, 256 << 20
    };

    for (int s = 0; s < sizeof(sizes)/sizeof(sizes[0]); s++) {
        size_t bytes = sizes[s];
        int elements = bytes / sizeof(float4);
        int iterations = 100;

        float4* d_data;
        float* d_output;
        cudaMalloc(&d_data, bytes);
        cudaMalloc(&d_output, sizeof(float));

        // Warmup
        bandwidth_probe<<<256, 256>>>(d_data, d_output, elements, 10);
        cudaDeviceSynchronize();

        // Timed run
        cudaEvent_t start, stop;
        cudaEventCreate(&start);
        cudaEventCreate(&stop);
        cudaEventRecord(start);
        bandwidth_probe<<<256, 256>>>(d_data, d_output, elements, iterations);
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        float ms;
        cudaEventElapsedTime(&ms, start, stop);
        double total_bytes = (double)bytes * iterations;
        double bandwidth_gbps = total_bytes / (ms * 1e-3) / 1e9;

        printf("Working set: %6zu MB  Bandwidth: %8.1f GB/s\n",
               bytes >> 20, bandwidth_gbps);

        cudaFree(d_data);
        cudaFree(d_output);
        cudaEventDestroy(start);
        cudaEventDestroy(stop);
    }
    return 0;
}

# Expected output on H100:
# Working set:      1 MB  Bandwidth:  11842.3 GB/s
# Working set:      2 MB  Bandwidth:  11790.1 GB/s
# Working set:      4 MB  Bandwidth:  11650.8 GB/s
# Working set:      8 MB  Bandwidth:  11520.4 GB/s
# Working set:     16 MB  Bandwidth:  11200.7 GB/s
# Working set:     32 MB  Bandwidth:  10800.2 GB/s
# Working set:     48 MB  Bandwidth:   8900.5 GB/s
# Working set:     50 MB  Bandwidth:   7600.3 GB/s  ← L2 boundary
# Working set:     64 MB  Bandwidth:   4800.1 GB/s
# Working set:    128 MB  Bandwidth:   3350.8 GB/s  ← HBM limited
# Working set:    256 MB  Bandwidth:   3320.5 GB/s

Cache-Aware Kernel Design Patterns

Pattern 1: Tile to L2

Structure kernels so that each tile of work operates on a data footprint that fits in L2:

// Matrix multiply with L2-aware tiling
// C[M,N] = A[M,K] * B[K,N]
// Process K in chunks that fit in L2
// L2 = 50 MB. If A tile is M x TILE_K and B tile is TILE_K x N:
// Data footprint = M * TILE_K + TILE_K * N (floats)
// For M=N=4096, TILE_K = 50MB / (2 * 4096 * 4 bytes) ≈ 1536

// But this is the global memory footprint.
// In practice, each SM only accesses a portion.
// The effective L2 working set is the union of all SMs' access patterns.

void tiled_gemm(float* C, const float* A, const float* B,
                int M, int N, int K) {
    int TILE_K = 1024;  // Chosen to keep A_tile + B_tile in L2
    for (int k = 0; k < K; k += TILE_K) {
        int k_size = min(TILE_K, K - k);
        gemm_kernel<<<grid, block>>>(C, A + k, B + k * N,
                                     M, N, K, k_size);
    }
}

Pattern 2: Persistent Kernels for L2 Reuse

A persistent kernel launches a fixed number of thread blocks (one per SM) that loop over work items. This keeps the thread blocks’ shared data resident in L2:

__global__ void persistent_attention(
    const float* __restrict__ Q,    // Queries
    const float* __restrict__ K,    // Keys (KV cache) — want this in L2
    const float* __restrict__ V,    // Values (KV cache)
    float* __restrict__ O,          // Output
    int num_heads, int seq_len, int head_dim) {

    // One block per SM, loops over heads
    int sm_id = blockIdx.x;
    int num_sms = gridDim.x;

    for (int head = sm_id; head < num_heads; head += num_sms) {
        // Each iteration accesses K and V for this head
        // If head_dim * seq_len fits in L2, K/V stay cached
        // across iterations on the same SM

        const float* q_head = Q + head * head_dim;
        const float* k_head = K + head * seq_len * head_dim;
        const float* v_head = V + head * seq_len * head_dim;

        // Compute attention for this head
        // ... (FlashAttention-style tiled computation)

        O[head * head_dim + threadIdx.x] = /* result */;
    }
}

// Launch with exactly num_SMs blocks to maximize L2 reuse
persistent_attention<<<132, 256>>>(Q, K, V, O, num_heads, seq_len, head_dim);

Pattern 3: Data Layout for Cache Line Alignment

// Misaligned structure: 20 bytes, crosses cache line boundaries
struct MisalignedData {
    float x, y, z;     // 12 bytes
    int flags;          // 4 bytes
    float weight;       // 4 bytes
};  // 20 bytes — not a power of 2, causes partial cache line usage

// Aligned structure: 32 bytes, exactly one sector per element
struct __align__(32) AlignedData {
    float x, y, z;     // 12 bytes
    int flags;          // 4 bytes
    float weight;       // 4 bytes
    float padding[3];   // 12 bytes padding
};  // 32 bytes — one sector per element, no wasted cache traffic

Pattern 4: Stream-and-Compute to Avoid L2 Pollution

For data that is accessed exactly once (streaming), bypass or minimize L2 footprint:

// Use L2 streaming hints (Ampere+)
// Load with cache bypass for streaming data
__global__ void stream_kernel(const float* __restrict__ stream_data,
                               const float* __restrict__ reuse_data,
                               float* __restrict__ output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // Load streaming data with cache-global (L2 only, not L1)
    float stream_val;
    asm volatile("ld.global.cg.f32 %0, [%1];"
                 : "=f"(stream_val) : "l"(stream_data + idx));

    // Load reuse data normally (cached in both L1 and L2)
    float reuse_val = reuse_data[idx % 1024];  // Small, reused array

    output[idx] = stream_val * reuse_val;
}

📊

L2 Cache Load Modifiers

Modifier	PTX Instruction	L1 Cache	L2 Cache	Use Case
Default (.ca)	ld.global.ca	Cached	Cached	General purpose, data reuse expected
Cache global (.cg)	ld.global.cg	Not cached	Cached	Data reused across SMs, not within SM
Cache streaming (.cs)	ld.global.cs	Not cached	Streaming (low priority)	Data accessed once, avoid L2 pollution
Last use (.lu)	ld.global.lu	Not cached	Evict after use	Data consumed exactly once
Cache volatile (.cv)	ld.global.cv	Not cached	Not cached	Bypass all caches (volatile data)

Note: Cache modifiers are set at the PTX level. The CUDA compiler may not always emit the desired modifier from C++ source code. Use inline PTX for precise control.

ℹ️ Cache Modifiers Are Hints, Not Guarantees

The GPU hardware treats cache modifiers as hints. Under memory pressure, even “persisting” data may be evicted. The hardware prioritizes correctness over the hint. Similarly, “streaming” data may still occupy L2 lines temporarily. The benefit comes from the eviction priority: streaming data is evicted before persisting data when the cache is full.

Multi-Kernel L2 Interference

The Problem: Sequential Kernels Thrash L2

In a typical LLM inference pipeline:

GEMM kernel (QKV projection): Reads weight matrix (hundreds of MB), writes activations.
Attention kernel: Reads Q, K, V (KV cache may be 10-50 MB). Wants K/V in L2.
GEMM kernel (output projection): Reads another weight matrix, evicts K/V from L2.
FFN GEMM kernels: Two large GEMMs, further thrashing L2.

Each GEMM reads a weight matrix much larger than L2, evicting everything. The attention kernel that follows finds none of its KV cache in L2.

Solution: Operator Fusion and Scheduling

// Approach 1: Fuse attention + output projection
// Keep KV cache in L2 by not launching an intervening GEMM

// Approach 2: Use L2 residency control (Hopper)
// Mark KV cache as persistent before the GEMM
cudaAccessPolicyWindow kv_policy = {};
kv_policy.base_ptr = kv_cache;
kv_policy.num_bytes = kv_cache_size;
kv_policy.hitRatio = 1.0f;
kv_policy.hitProp = cudaAccessPropertyPersisting;
kv_policy.missProp = cudaAccessPropertyStreaming;
cudaStreamSetAccessPolicyWindow(stream, &kv_policy);

// GEMM kernel runs — its weight data gets streaming priority
// KV cache stays in L2 despite the GEMM traffic
gemm_kernel<<<...>>>(weights, activations, output);

// Attention kernel runs — KV cache is still in L2
attention_kernel<<<...>>>(Q, K, V, output);

Measuring L2 Interference

# Profile L2 hit rates with and without residency control
# Without:
ncu --metrics lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
              lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum \
  -k attention_kernel ./inference_no_policy

# With:
ncu --metrics lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum,\
              lts__t_sectors_srcunit_tex_op_read_lookup_miss.sum \
  -k attention_kernel ./inference_with_policy

# Expected: hit rate jumps from ~15% to ~85% with residency control

L2 Cache and Multi-GPU Considerations

NVLink and L2

On NVLink-connected GPUs (DGX H100), one GPU can access another GPU’s HBM via NVLink. These remote accesses do not go through the remote GPU’s L2 cache — they access remote HBM directly. However, the local GPU caches the remote data in its own L2:

// GPU 0 accesses GPU 1's memory:
// 1. GPU 0 L1 miss → GPU 0 L2 lookup (local)
// 2. GPU 0 L2 miss → NVLink request to GPU 1
// 3. GPU 1 HBM responds (bypasses GPU 1's L2)
// 4. Data stored in GPU 0's L2 (for future local reuse)
// Latency: ~2-3 microseconds (NVLink round trip)

MIG and L2 Partitioning

When MIG (Multi-Instance GPU) is enabled, the L2 cache is physically partitioned along with the SMs and memory controllers. Each MIG instance gets a dedicated slice of L2:

// H100 with MIG enabled:
// 7x 1g.10gb instances:
//   Each gets: ~7 MB L2, ~10 GB HBM, 16-19 SMs
// 1x 7g.80gb instance (full GPU):
//   Gets: 50 MB L2, 80 GB HBM, 132 SMs
// 2x 3g.40gb instances:
//   Each gets: ~21 MB L2, 40 GB HBM, ~44 SMs

The L2 partition is enforced by hardware. One MIG instance cannot pollute another’s L2 cache.

Summary

The L2 cache is the bandwidth multiplier for GPU workloads. Kernels that keep their working set under L2 capacity (50 MB on H100) run at 3-4x the bandwidth of kernels that must access HBM. The practical strategies: size tiles to fit in L2, use persistent kernels to maintain L2 residency across iterations, apply Hopper’s cudaAccessPolicyWindow to protect hot data from eviction by streaming traffic, use cache modifiers (ld.global.cs) to mark single-use data as streaming, and measure L2 hit rates with Nsight Compute to validate that cache-aware designs actually improve performance.

For LLM inference specifically: the KV cache size relative to L2 capacity is the critical metric. If the KV cache fits in L2 (context length $\times$ num_layers $\times$ num_heads $\times$ head_dim $\times$ 2 $\times$ precision_bytes $\lt 50$ MB), attention kernels run at L2 bandwidth. If it does not, attention becomes HBM-bandwidth-limited — and the only solution is either reducing the KV cache (GQA, quantization) or using hardware with larger L2 (Blackwell’s ~96 MB).