Part of Series GPU Hardware & AI Accelerators 8 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell โ€” What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e โ€” Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA โ€” What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute โ€” Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 โ€” What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion โ€” Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The GPU memory hierarchy determines kernel performance more than any other factor. A kernel that fits in L1 cache (192 KB per SM on H100) runs at 19 TB/s aggregate bandwidth across all SMs. The same kernel accessing data in L2 (50 MB on H100) runs at approximately 12 TB/s. Hit HBM (80 GB on H100), and you are at 3.35 TB/s. The 5.7x bandwidth gap between L1 and HBM means that understanding cache behavior โ€” what hits, what misses, and why โ€” directly determines whether a kernel achieves 20% or 90% of peak performance.

This post maps the complete GPU memory hierarchy, documents cache line sizes and eviction policies, analyzes how working set size affects effective bandwidth, and implements experiments to measure cache hit rates under different access patterns.

The Memory Hierarchy

Overview

๐Ÿ“Š

GPU Memory Hierarchy (H100 SXM)

LevelSizeBandwidthLatency (cycles)Scope
Registers 256 KB per SM ~19 TB/s (aggregate) 0 (same cycle) Per-thread
Shared Memory Up to 228 KB per SM ~19 TB/s (aggregate) ~20-30 Per-block
L1 Cache Up to 256 KB per SM (shared pool) ~19 TB/s (aggregate) ~30 Per-SM
L2 Cache 50 MB ~12 TB/s ~200-400 GPU-wide
HBM3 80 GB 3.35 TB/s ~400-800 GPU-wide
PCIe/NVLink 64/900 GB/s N/A ~10,000+ System
Note: Bandwidth drops ~6x from L1 to HBM. Latency increases ~20x. Every cache miss cascades down the hierarchy.

L1 Cache and Shared Memory: The Configurable Pool

On Ampere and Hopper, the L1 cache and shared memory share a physical SRAM pool per SM. The programmer configures the split:

// Configure shared memory size for a kernel
// This implicitly sets the L1 cache size = pool_total - shared_memory
cudaFuncSetAttribute(
    my_kernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize,
    164 * 1024  // Request 164 KB shared memory
);

// On H100: pool is ~256 KB per SM
// If shared = 164 KB, then L1 = 256 - 164 = 92 KB
// If shared = 0 KB, then L1 = 256 KB (maximum L1)
// If shared = 228 KB (max), then L1 = 28 KB (minimum L1)
def l1_shared_tradeoff(pool_size_kb=256, shared_kb=0):
    """Analyze L1 vs shared memory tradeoff."""
    l1_kb = pool_size_kb - shared_kb
    print(f"Pool: {pool_size_kb} KB")
    print(f"Shared memory: {shared_kb} KB")
    print(f"L1 cache: {l1_kb} KB")
    print()

    configs = [
        (0, "Maximum L1 (elementwise kernels, no shared mem needed)"),
        (48, "Default (backward compat, small tile GEMM)"),
        (100, "Medium shared (FlashAttention Q/K/V tiles)"),
        (164, "Large shared (large GEMM tiles, Triton)"),
        (228, "Maximum shared (CUTLASS large tiles, WGMMA)"),
    ]

    print(f"{'Shared (KB)':<15} {'L1 (KB)':<10} {'Use Case'}")
    for sm_kb, use_case in configs:
        l1 = pool_size_kb - sm_kb
        print(f"{sm_kb:<15} {l1:<10} {use_case}")

l1_shared_tradeoff()

L2 Cache: The Universal Intermediary

All traffic between SMs and HBM passes through the L2 cache. There is no way to bypass it (unlike CPU architectures where non-temporal stores can bypass L2).

def l2_cache_analysis(l2_size_mb=50, num_sms=132):
    """Analyze L2 cache characteristics on H100."""
    l2_bytes = l2_size_mb * 1024 * 1024

    # L2 is partitioned into slices, one per memory controller
    # H100: 5 HBM3 stacks, each with 2 channels = 10 memory controllers
    num_mc = 10
    l2_per_mc = l2_bytes / num_mc

    print(f"L2 total: {l2_size_mb} MB")
    print(f"Memory controllers: {num_mc}")
    print(f"L2 per memory controller: {l2_per_mc / 1024:.0f} KB")
    print()

    # Cache line: 128 bytes
    cache_line = 128
    num_lines = l2_bytes // cache_line
    print(f"Cache line size: {cache_line} bytes")
    print(f"Total cache lines: {num_lines:,}")
    print(f"Associativity: ~16-32 way (architecture-dependent)")
    print()

    # What fits in L2?
    model_sizes = {
        "Llama-2 7B weights (FP16)": 14e9,
        "Llama-2 7B weights (INT4)": 3.5e9,
        "KV cache (1K tokens, 7B)": 0.5e9,
        "KV cache (4K tokens, 7B)": 2e9,
        "Activation buffer (bs=1, 4096 hidden)": 4096 * 2,
        "Attention scores (32 heads, 4K tokens)": 32 * 4096 * 4096 * 2,
    }

    print("What fits in 50 MB L2?")
    for name, size_bytes in model_sizes.items():
        fits = "Yes" if size_bytes <= l2_bytes else "No"
        print(f"  {name}: {size_bytes/1e6:.1f} MB -> {fits}")

l2_cache_analysis()

Cache Line Behavior

128-Byte Cache Lines

Both L1 and L2 use 128-byte cache lines. A cache line is the minimum unit of transfer between memory levels.

def cache_line_analysis():
    """Analyze cache line access patterns."""
    cache_line_bytes = 128

    # FP16: 2 bytes per element
    # One cache line holds 64 FP16 elements
    fp16_per_line = cache_line_bytes // 2
    print(f"FP16 elements per cache line: {fp16_per_line}")

    # FP32: 4 bytes per element
    fp32_per_line = cache_line_bytes // 4
    print(f"FP32 elements per cache line: {fp32_per_line}")

    # INT8: 1 byte per element
    int8_per_line = cache_line_bytes // 1
    print(f"INT8 elements per cache line: {int8_per_line}")
    print()

    # A warp of 32 threads reading consecutive FP16 values
    # reads 32 * 2 = 64 bytes = 0.5 cache lines
    # The hardware fetches 1 full cache line (128 bytes)
    # Utilization: 64/128 = 50%
    #
    # A warp of 32 threads reading consecutive FP32 values
    # reads 32 * 4 = 128 bytes = exactly 1 cache line
    # Utilization: 128/128 = 100%

    warp_size = 32
    for dtype, elem_size in [("FP16", 2), ("FP32", 4), ("INT8", 1)]:
        bytes_per_warp = warp_size * elem_size
        lines_needed = (bytes_per_warp + cache_line_bytes - 1) // cache_line_bytes
        utilization = bytes_per_warp / (lines_needed * cache_line_bytes)
        print(f"Warp reading consecutive {dtype}: "
              f"{bytes_per_warp} bytes, "
              f"{lines_needed} cache line(s), "
              f"{utilization*100:.0f}% utilization")

cache_line_analysis()

Cache Eviction: LRU-Like Policy

def cache_eviction_behavior():
    """Explain GPU cache eviction policies."""
    policies = {
        "L1 (data cache)": {
            "policy": "LRU-like with sector tracking",
            "details": "L1 tracks 128-byte lines but loads in "
                      "32-byte sectors. A line with no valid sectors "
                      "is evicted first.",
            "write_policy": "Write-through to L2 (L1 is not write-back "
                          "for global memory)",
        },
        "L2": {
            "policy": "LRU-like with multiple replacement policies",
            "details": "L2 supports different eviction hints: "
                      "normal (LRU), streaming (evict-first), "
                      "persistent (evict-last).",
            "write_policy": "Write-back to HBM (dirty lines are "
                          "written back on eviction)",
        },
    }

    for level, info in policies.items():
        print(f"\n{level}:")
        for key, val in info.items():
            print(f"  {key}: {val}")

L2 Cache Residency Control (Hopper)

Persistent L2 Cache Lines

Hopper introduced the ability to mark a region of L2 as โ€œpersistentโ€ โ€” these cache lines are evicted last, effectively reserving a portion of L2 for frequently accessed data.

// Set L2 persistence for a memory region
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

// Maximum L2 that can be set aside for persistence
size_t l2_persist_max = prop.persistingL2CacheMaxSize;
// On H100: typically 40 MB of the 50 MB L2

// Set the persistent window size
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize,
                    40 * 1024 * 1024);  // 40 MB

// Mark a specific allocation as persistent
cudaStreamAttrValue stream_attr;
stream_attr.accessPolicyWindow.base_ptr = kv_cache_ptr;
stream_attr.accessPolicyWindow.num_bytes = kv_cache_size;
stream_attr.accessPolicyWindow.hitRatio = 1.0;  // Keep all hits
stream_attr.accessPolicyWindow.hitProp =
    cudaAccessPropertyPersisting;   // Persist in L2
stream_attr.accessPolicyWindow.missProp =
    cudaAccessPropertyStreaming;    // Evict quickly

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
                        &stream_attr);

Use Case: KV Cache in L2

def kv_cache_l2_analysis():
    """Analyze KV cache residency in L2."""
    # KV cache size depends on model and context length
    # Per layer: 2 (K+V) * batch * heads * seq_len * head_dim * dtype_bytes

    configs = [
        # (model, layers, heads, head_dim, batch, seq_len)
        ("Llama-2 7B", 32, 32, 128, 1, 512),
        ("Llama-2 7B", 32, 32, 128, 1, 2048),
        ("Llama-2 7B", 32, 32, 128, 1, 8192),
        ("Llama-2 7B", 32, 32, 128, 32, 2048),
        ("Llama-2 70B", 80, 8, 128, 1, 2048),  # GQA: 8 KV heads
    ]

    l2_size = 50 * 1024 * 1024  # 50 MB

    print(f"{'Config':<45} {'KV Size':<12} {'Fits L2?'}")
    for model, layers, kv_heads, head_dim, batch, seq in configs:
        kv_bytes = 2 * layers * batch * kv_heads * seq * head_dim * 2
        fits = "Yes" if kv_bytes <= l2_size else "No"
        config_str = f"{model}, bs={batch}, seq={seq}"
        print(f"{config_str:<45} {kv_bytes/1e6:>8.1f} MB  {fits}")

kv_cache_l2_analysis()
๐Ÿ“Š

KV Cache Size vs L2 Cache Capacity (FP16)

ModelBatchSeq LenKV Cache SizeFits H100 L2 (50 MB)
Llama-2 7B (GQA 32 heads) 1 512 8.4 MB Yes
Llama-2 7B (GQA 32 heads) 1 2048 33.6 MB Yes
Llama-2 7B (GQA 32 heads) 1 8192 134.2 MB No
Llama-2 7B (GQA 32 heads) 32 2048 1073.7 MB No
Llama-2 70B (GQA 8 heads) 1 2048 26.2 MB Yes
Note: KV cache fits in L2 only for short contexts and small batches. For Llama-2 70B with GQA (8 KV heads), the KV cache is 4x smaller per layer than with 32 heads, making L2 residency feasible for moderate contexts.

Measuring Cache Behavior

Working Set Size Experiment

// Measure effective bandwidth at different working set sizes
// to determine L1, L2, and HBM bandwidth

__global__ void bandwidth_test(
    const float* __restrict__ data,
    float* __restrict__ output,
    int working_set_elements,
    int total_accesses
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    float sum = 0.0f;
    for (int i = tid; i < total_accesses; i += stride) {
        // Access within the working set (wrap around)
        int idx = i % working_set_elements;
        sum += data[idx];
    }

    // Prevent dead code elimination
    if (tid == 0) {
        output[0] = sum;
    }
}
import torch
import time

def measure_bandwidth_vs_working_set():
    """Measure effective bandwidth at different working set sizes.

    Small working set (fits in L1): high bandwidth
    Medium working set (fits in L2): moderate bandwidth
    Large working set (HBM): lowest bandwidth
    """
    device = 'cuda'
    total_accesses = 256 * 1024 * 1024  # 256M accesses

    # Working set sizes from 4 KB to 1 GB
    working_set_sizes_kb = [
        4, 8, 16, 32, 64, 128, 256, 512,
        1024, 2048, 4096, 8192, 16384, 32768,
        65536, 131072, 262144, 524288, 1048576
    ]

    results = []
    for ws_kb in working_set_sizes_kb:
        ws_elements = ws_kb * 1024 // 4  # FP32, 4 bytes per element
        data = torch.randn(ws_elements, device=device, dtype=torch.float32)
        output = torch.zeros(1, device=device, dtype=torch.float32)

        # Warmup
        for _ in range(3):
            output[0] = data[:min(1000, ws_elements)].sum()

        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        # Launch kernel that reads total_accesses elements
        # from the working set (wrapping around)
        blocks = 256
        threads = 256
        accesses_per_thread = total_accesses // (blocks * threads)

        start.record()
        # Simulate by reading data in a loop
        for _ in range(10):
            _ = data.sum()
        end.record()

        torch.cuda.synchronize()
        elapsed_ms = start.elapsed_time(end) / 10

        bytes_read = ws_elements * 4  # Approximate
        bandwidth_gb_s = (bytes_read / 1e9) / (elapsed_ms / 1000)

        results.append((ws_kb, bandwidth_gb_s))

        if ws_kb in [16, 128, 1024, 16384, 262144, 1048576]:
            print(f"WS={ws_kb:>8d} KB: {bandwidth_gb_s:>8.1f} GB/s")

    return results

# The output will show:
# - WS < ~200 KB: L1 bandwidth (~19 TB/s aggregate)
# - WS 200 KB - 50 MB: L2 bandwidth (~12 TB/s)
# - WS > 50 MB: HBM bandwidth (~3.35 TB/s)

Detecting Cache Boundaries

def identify_cache_boundaries(results):
    """Identify L1/L2/HBM boundaries from bandwidth measurements."""
    prev_bw = results[0][1]

    for i in range(1, len(results)):
        ws_kb, bw = results[i]
        ratio = bw / prev_bw

        if ratio < 0.7 and prev_bw > 5000:
            print(f"L1 -> L2 boundary: ~{results[i-1][0]} KB")
        elif ratio < 0.5 and prev_bw > 2000:
            print(f"L2 -> HBM boundary: ~{results[i-1][0]} KB "
                  f"({results[i-1][0]/1024:.0f} MB)")

        prev_bw = bw

Effective Bandwidth vs Working Set Size (H100, per-SM)

(GB/s per SM)
16 KB (L1) L1 hit
144 GB/s per SM
128 KB (L1)
138 GB/s per SM
256 KB (L1 edge) L1 miss starts
105 GB/s per SM
1 MB (L2)
85 GB/s per SM
16 MB (L2)
78 GB/s per SM
50 MB (L2 edge) L2 miss starts
52 GB/s per SM
256 MB (HBM)
25 GB/s per SM
1 GB (HBM)
25 GB/s per SM

Access Pattern Effects on Cache Performance

Sequential vs Strided Access

// Sequential access: maximum spatial locality
__global__ void sequential_access(
    const float* data, float* output, int n
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    float sum = 0.0f;
    for (int i = tid; i < n; i += stride) {
        sum += data[i];  // Sequential: data[0], data[1], ...
    }
    output[tid] = sum;
}

// Strided access: cache line waste
__global__ void strided_access(
    const float* data, float* output, int n, int stride_factor
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int grid_stride = gridDim.x * blockDim.x;

    float sum = 0.0f;
    for (int i = tid; i < n / stride_factor; i += grid_stride) {
        // Each access skips stride_factor elements
        // Only 1/stride_factor of each cache line is useful
        sum += data[i * stride_factor];
    }
    output[tid] = sum;
}
def analyze_strided_access():
    """Quantify cache line waste from strided access."""
    cache_line = 128  # bytes
    element_size = 4  # FP32
    elements_per_line = cache_line // element_size  # 32

    strides = [1, 2, 4, 8, 16, 32, 64, 128]

    print(f"{'Stride':<10} {'Useful bytes/line':<20} "
          f"{'Utilization':<15} {'Effective BW'}")
    for stride in strides:
        # If stride >= elements_per_line, every access fetches a new line
        # with only element_size useful bytes
        useful = min(element_size * (elements_per_line // stride),
                     cache_line)
        useful = max(useful, element_size)
        utilization = useful / cache_line
        # Assume 3350 GB/s HBM
        effective_bw = 3350 * utilization

        print(f"{stride:<10} {useful:<20} {utilization*100:>6.1f}%"
              f"        {effective_bw:>7.0f} GB/s")

analyze_strided_access()
๐Ÿ“Š

Access Stride vs Effective Bandwidth (H100, FP32, from HBM)

Stride (elements)Cache Line UtilizationEffective BW (GB/s)Relative
1 (sequential) 100% 3350 1.00x
2 50% 1675 0.50x
4 25% 838 0.25x
8 12.5% 419 0.125x
16 6.25% 209 0.063x
32 (one per line) 3.1% 105 0.031x
Note: Stride-32 access (one FP32 per cache line) wastes 97% of HBM bandwidth. This is the performance cliff that uncoalesced accesses hit.

Random Access Pattern

def random_access_analysis(l2_size_mb=50, cache_line=128):
    """Analyze random access pattern cache behavior."""
    l2_bytes = l2_size_mb * 1024 * 1024
    l2_lines = l2_bytes // cache_line

    # Random access to a working set:
    # If working_set_lines < l2_lines, most accesses hit L2
    # If working_set_lines > l2_lines, most accesses miss L2

    working_sets_mb = [1, 5, 10, 25, 50, 100, 500, 1000]

    print(f"{'WS (MB)':<10} {'WS Lines':<12} {'L2 Lines':<10} "
          f"{'Est Hit Rate'}")
    for ws_mb in working_sets_mb:
        ws_bytes = ws_mb * 1024 * 1024
        ws_lines = ws_bytes // cache_line

        # Simplified model: if WS fits in L2, ~100% hit rate
        # If WS is 2x L2, ~50% hit rate (LRU model)
        if ws_lines <= l2_lines:
            hit_rate = min(1.0, l2_lines / max(ws_lines, 1))
        else:
            hit_rate = l2_lines / ws_lines

        print(f"{ws_mb:<10} {ws_lines:<12,} {l2_lines:<10,} "
              f"{hit_rate*100:>6.1f}%")

random_access_analysis()
โ„น๏ธ Attention Computes Random-Access the KV Cache

During attention decode, each new token attends to all previous positions in the KV cache. The access pattern is effectively random within the KV cache region. If the KV cache fits in L2 (short context, small batch), attention runs at L2 bandwidth. If it exceeds L2, attention drops to HBM bandwidth. This is why longer contexts are disproportionately slower โ€” it is not just the O(N) compute, but the L2 miss rate.

Practical Optimization Strategies

Strategy 1: Minimize Working Set

def working_set_optimization():
    """Strategies to keep the working set in faster cache."""
    strategies = {
        "Quantization (INT4/FP8)": {
            "effect": "4x/2x smaller weights -> 4x/2x more fits in L2",
            "example": "70B model INT4 weights: 35 GB. Per-layer: "
                      "~440 MB -> still exceeds L2, but sub-layer "
                      "tiles may fit",
        },
        "Tiling": {
            "effect": "Process data in tiles that fit in L1/shared memory",
            "example": "FlashAttention processes attention in "
                      "256-token blocks that fit in shared memory",
        },
        "GQA (Grouped Query Attention)": {
            "effect": "Fewer KV heads -> smaller KV cache -> "
                     "more likely to fit in L2",
            "example": "Llama-2 70B: 8 KV heads vs 64 Q heads -> "
                      "8x smaller KV cache",
        },
        "Operator fusion": {
            "effect": "Keep intermediate results in registers/shared mem "
                     "instead of writing to HBM",
            "example": "Fused bias+GELU: intermediate tensor never "
                      "touches HBM",
        },
    }

    for name, info in strategies.items():
        print(f"\n{name}:")
        print(f"  Effect: {info['effect']}")
        print(f"  Example: {info['example']}")

Strategy 2: Software Prefetching

// Use cp.async to prefetch data before it is needed
// This hides HBM latency by overlapping load with compute

__global__ void prefetched_kernel(
    const float* __restrict__ data,
    float* __restrict__ output,
    int n,
    int iterations
) {
    extern __shared__ float smem[];
    int tid = threadIdx.x;
    int block_offset = blockIdx.x * blockDim.x;

    // Prefetch first tile
    if (block_offset + tid < n) {
        // cp.async: asynchronous copy from global to shared
        asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n"
            :: "r"((unsigned)(&smem[tid])),
               "l"(&data[block_offset + tid]));
    }
    asm volatile("cp.async.commit_group;\n");
    asm volatile("cp.async.wait_group 0;\n");
    __syncthreads();

    // Process tile from shared memory while next tile loads
    float result = smem[tid] * 2.0f;  // Example computation
    output[block_offset + tid] = result;
}

Summary

The GPU memory hierarchy spans 4 levels with a combined 5.7x bandwidth gap between the fastest (L1 at ~19 TB/s aggregate) and slowest (HBM at 3.35 TB/s). Cache line size is 128 bytes at both L1 and L2. Access patterns that waste cache lines (strided or random) can reduce effective bandwidth by 8-32x. Working set size determines which cache level serves the data: under 200 KB hits L1, under 50 MB (H100) hits L2, everything else hits HBM.

For LLM inference, the key cache behavior is the KV cache during attention. Short contexts with small batches fit in L2 (high bandwidth). Long contexts exceed L2 and fall to HBM bandwidth โ€” this is the primary reason why per-token latency increases with context length beyond a threshold. Optimization strategies focus on reducing the effective working set: quantization (smaller data), tiling (process in cache-sized chunks), and GQA (fewer KV heads).