GPU Memory Hierarchy: Why It Matters for Inference
An H100 GPU has 989 TFLOPS of FP16 compute but only 3.35 TB/s of memory bandwidth. During autoregressive decode, the GPU spends most of its time waiting for weights to arrive from HBM, not multiplying them — the arithmetic units sit idle while memory channels saturate. Understanding this memory hierarchy is how you reason about every inference optimization that follows.
GPU Memory Hierarchy (H100 SXM)
HBM (High Bandwidth Memory) is the GPU’s main memory — 80 GB on an H100. This is where model weights and KV cache live. At 3.35 TB/s, it’s fast compared to CPU RAM (50 GB/s), but slow compared to on-chip SRAM (33 TB/s).
SRAM is the on-chip scratchpad memory — only ~50 MB total but 10x faster than HBM. This is where FlashAttention does its work (tiling computation to stay in SRAM).
Tensor Cores are specialized matrix multiply units that can compute 16x16 matrix products in a single clock cycle. An H100 has 989 TFLOPS of FP16 compute with tensor cores, but only if you can feed them data fast enough — which brings us to the roofline model.
The Roofline Model: Memory-Bound vs Compute-Bound
The roofline model is the single most important mental model for understanding LLM inference. It tells you whether your workload is limited by compute speed or memory bandwidth.
Arithmetic intensity = FLOPs performed / bytes moved from memory. This is measured in FLOP/byte.
For any workload:
- If arithmetic intensity is LOW (few FLOPs per byte loaded): you’re memory-bandwidth-bound. The GPU finishes computing before the next batch of data arrives from HBM.
- If arithmetic intensity is HIGH (many FLOPs per byte loaded): you’re compute-bound. Data arrives faster than the GPU can process it.
The crossover point is the ridge point: Peak TFLOPS / HBM bandwidth. For H100: FLOP/byte.
Roofline Model: H100 SXM
(% of peak GPU utilization)Key insight: LLM decode at batch=1 has arithmetic intensity of ~1 FLOP/byte — meaning the GPU computes 1 floating-point operation for each byte loaded from memory. The H100’s ridge point is 295 FLOP/byte. So decode at batch=1 uses 0.3% of the GPU’s compute capability. The GPU sits idle 99.7% of the time, waiting for data.
This is why every inference optimization in this series exists: they all try to increase arithmetic intensity (batching), reduce bytes moved (quantization), or generate more tokens per memory load (speculative decoding).
What a Transformer Does
A Large Language Model (LLM) like Llama, GPT-4, or Claude is a transformer — a neural network that predicts the next token in a sequence. The term autoregressive means “generates one token at a time, feeding each generated token back as input for the next step.”
Input: a sequence of tokens (integers representing words/subwords). Output: a probability distribution over the vocabulary for the next token.
The Attention Mechanism (Simplified)
Each token produces three vectors from its representation:
- Query (Q): “What am I looking for?” — represents this token’s current needs
- Key (K): “What do I contain?” — represents this token’s identity/content
- Value (V): “What information do I carry?” — the actual content to aggregate
Attention computes: for each token’s Query, compare it against every previous token’s Key (dot product). The resulting scores determine how much of each previous token’s Value to incorporate. High score = “this previous token is relevant to me.”
The attention formula:
This is in sequence length — every token attends to every previous token. For a 4,096-token sequence, that’s 16 million attention score computations per head per layer.
Multi-Head and Grouped Query Attention
Instead of one attention computation, transformers use multiple heads (64 for Llama 70B). Each head attends to different aspects of the input (syntax, semantics, entity tracking, etc.).
Multi-Head Attention (MHA): Each head has its own Q, K, and V. 64 heads = 64 independent attention computations.
Grouped Query Attention (GQA): An optimization where multiple Query heads share the same Key and Value heads. Llama 70B uses 64 Q heads but only 8 KV heads — each KV head is shared among 8 Q heads. This reduces the KV cache by 8x with minimal quality loss.
The Full Layer
A transformer is a stack of identical layers (80 layers for Llama 70B). Each layer has two sublayers:
- Attention: Each token looks at all previous tokens and computes a weighted sum (as described above)
- FFN (Feed-Forward Network): Each token independently passes through two large matrix multiplies with a nonlinearity (SwiGLU). This is where factual knowledge is stored — the FFN acts as a key-value memory.
Both sublayers use residual connections: the output is added to the input, not replacing it. This is critical for gradient flow during training.
The Shapes That Matter for Inference
Key Dimensions (Llama 3 70B)
| Dimension | Symbol | Value | What It Is |
|---|---|---|---|
| Model dimension | d_model | 8192 | Width of each token's representation vector |
| Attention heads | n_heads | 64 | Number of parallel attention computations per layer |
| KV heads | n_kv_heads | 8 | GQA: 8 KV heads shared among 64 Q heads (8x memory savings) |
| Head dimension | d_head | 128 | d_model / n_heads — size of each head's Q/K/V vectors |
| FFN hidden dim | d_ff | 28672 | 3.5x d_model (SwiGLU uses 8/3 expansion ratio) |
| Vocab size | V | 128256 | Number of possible tokens the model can output |
| Layers | L | 80 | Number of transformer layers stacked |
| Total parameters | N | ~70B | All weights stored in GPU HBM — 140 GB at FP16 |
KV Cache: Why It Exists
During generation, the model produces tokens one at a time (autoregressive). At each step, the attention mechanism needs ALL previous tokens’ Key and Value vectors. Without caching, you’d recompute K and V for all previous tokens every step — total work for tokens.
KV cache: Store the K and V vectors from all previous tokens in GPU memory. Each new step only computes K, V for the new token and appends to the cache. Attention then reads the full cache but only computes Q for the new token.
Cache size per token per layer: bytes (Llama 70B, FP16).
Across 80 layers: bytes = 320 KB per token.
At batch=32 with 4096 tokens per sequence: GB of KV cache alone — half of an H100’s 80 GB memory. This is why KV cache management is the central problem in inference optimization, and why later posts in this series cover techniques like paged memory allocation and cache quantization.
Prefill vs Decode: Two Different Bottlenecks
LLM inference has two distinct phases with different performance characteristics:
The Two Phases of LLM Inference
Prefill processes all prompt tokens simultaneously as a single large matrix multiply. With 1000+ tokens, the arithmetic intensity is high enough to saturate tensor cores. GPU utilization: 60-80%. This is compute-bound.
Decode generates one token at a time. Each forward pass loads ALL 140 GB of model weights from HBM to compute just one token’s worth of output. Arithmetic intensity: FLOP/byte. At the H100’s ridge point of 295 FLOP/byte, decode is 295x below the compute ceiling — purely memory-bandwidth-limited.
The math: At batch=1, decode throughput = HBM bandwidth / model size = tokens/second. At batch=64, you amortize the weight loading: tokens/second.
Decode Throughput Scaling with Batch Size (Llama 70B, H100)
(tokens/sec)The Optimization Landscape
Every technique in this series attacks one of these bottlenecks:
How Inference Optimizations Map to Bottlenecks
| Technique | Covered In | What It Does | Which Bottleneck |
|---|---|---|---|
| Batching | Part 1 | Process multiple requests simultaneously to amortize weight loading | Memory bandwidth |
| KV Cache | Part 2 | Avoid recomputing attention keys/values for past tokens | Compute (eliminates redundant work) |
| Quantization | Part 3 | Reduce weight precision (FP16 to INT8/INT4) — less data to load from HBM | Memory bandwidth + capacity |
| FlashAttention | Part 4 | Tile attention computation to stay in fast SRAM instead of slow HBM | Memory bandwidth |
| Paged KV Cache | Part 5 | Allocate KV cache in fixed-size pages (like OS virtual memory) to eliminate fragmentation | Memory capacity |
| Continuous Batching | Part 6 | Add/remove requests from batch dynamically instead of waiting for all to finish | GPU utilization |
| Speculative Decoding | Part 7 | Use a small model to draft multiple tokens, verify in one pass of the big model | Memory bandwidth (multiple tokens per weight load) |
If you read the Transformer Anatomy series: Skip this bridge, go directly to Part 1. If you’re a systems engineer new to LLMs: Read this bridge, then Parts 1-7 (fundamentals through speculative decoding), then pick topics relevant to your work. If you’re deploying vLLM/SGLang: Read this bridge, then Parts 5-6 (paged KV cache, continuous batching), then the vLLM Internals series.
Continue to Part 1: LLM Inference Fundamentals for the full treatment with detailed roofline analysis, memory math, and throughput equations.