0

GPU Memory Hierarchy: Why It Matters for Inference

An H100 GPU has 989 TFLOPS of FP16 compute but only 3.35 TB/s of memory bandwidth. During autoregressive decode, the GPU spends most of its time waiting for weights to arrive from HBM, not multiplying them — the arithmetic units sit idle while memory channels saturate. Understanding this memory hierarchy is how you reason about every inference optimization that follows.

GPU Memory Hierarchy (H100 SXM)

Registers ~20 MB total across all SMs Fastest: within compute units, no latency
SRAM (Shared Memory + L1) ~50 MB total, 128 KB per SM ~33 TB/s aggregate bandwidth, 20-30 cycle latency
L2 Cache 50 MB ~12 TB/s bandwidth, 200+ cycle latency
HBM3 (Main GPU Memory) 80 GB 3.35 TB/s bandwidth, 400+ cycle latency

HBM (High Bandwidth Memory) is the GPU’s main memory — 80 GB on an H100. This is where model weights and KV cache live. At 3.35 TB/s, it’s fast compared to CPU RAM (50 GB/s), but slow compared to on-chip SRAM (33 TB/s).

SRAM is the on-chip scratchpad memory — only ~50 MB total but 10x faster than HBM. This is where FlashAttention does its work (tiling computation to stay in SRAM).

Tensor Cores are specialized matrix multiply units that can compute 16x16 matrix products in a single clock cycle. An H100 has 989 TFLOPS of FP16 compute with tensor cores, but only if you can feed them data fast enough — which brings us to the roofline model.

The Roofline Model: Memory-Bound vs Compute-Bound

The roofline model is the single most important mental model for understanding LLM inference. It tells you whether your workload is limited by compute speed or memory bandwidth.

Arithmetic intensity = FLOPs performed / bytes moved from memory. This is measured in FLOP/byte.

For any workload:

  • If arithmetic intensity is LOW (few FLOPs per byte loaded): you’re memory-bandwidth-bound. The GPU finishes computing before the next batch of data arrives from HBM.
  • If arithmetic intensity is HIGH (many FLOPs per byte loaded): you’re compute-bound. Data arrives faster than the GPU can process it.

The crossover point is the ridge point: Peak TFLOPS / HBM bandwidth. For H100: 989 TFLOPS/3.35 TB/s=295989 \text{ TFLOPS} / 3.35 \text{ TB/s} = 295 FLOP/byte.

Roofline Model: H100 SXM

(% of peak GPU utilization)
Decode batch=1 (1 FLOP/byte) 295x below ridge point!
3 % of peak GPU utilization
Decode batch=64 (64 FLOP/byte) Still memory-bound
22 % of peak GPU utilization
Decode batch=256 (256 FLOP/byte) Near ridge point
87 % of peak GPU utilization
Prefill (high arithmetic intensity) Compute-bound
75 % of peak GPU utilization

Key insight: LLM decode at batch=1 has arithmetic intensity of ~1 FLOP/byte — meaning the GPU computes 1 floating-point operation for each byte loaded from memory. The H100’s ridge point is 295 FLOP/byte. So decode at batch=1 uses 0.3% of the GPU’s compute capability. The GPU sits idle 99.7% of the time, waiting for data.

This is why every inference optimization in this series exists: they all try to increase arithmetic intensity (batching), reduce bytes moved (quantization), or generate more tokens per memory load (speculative decoding).

What a Transformer Does

A Large Language Model (LLM) like Llama, GPT-4, or Claude is a transformer — a neural network that predicts the next token in a sequence. The term autoregressive means “generates one token at a time, feeding each generated token back as input for the next step.”

Input: a sequence of tokens (integers representing words/subwords). Output: a probability distribution over the vocabulary for the next token.

The Attention Mechanism (Simplified)

Each token produces three vectors from its representation:

  • Query (Q): “What am I looking for?” — represents this token’s current needs
  • Key (K): “What do I contain?” — represents this token’s identity/content
  • Value (V): “What information do I carry?” — the actual content to aggregate

Attention computes: for each token’s Query, compare it against every previous token’s Key (dot product). The resulting scores determine how much of each previous token’s Value to incorporate. High score = “this previous token is relevant to me.”

The attention formula: Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

This is O(n2)O(n^2) in sequence length — every token attends to every previous token. For a 4,096-token sequence, that’s 16 million attention score computations per head per layer.

Multi-Head and Grouped Query Attention

Instead of one attention computation, transformers use multiple heads (64 for Llama 70B). Each head attends to different aspects of the input (syntax, semantics, entity tracking, etc.).

Multi-Head Attention (MHA): Each head has its own Q, K, and V. 64 heads = 64 independent attention computations.

Grouped Query Attention (GQA): An optimization where multiple Query heads share the same Key and Value heads. Llama 70B uses 64 Q heads but only 8 KV heads — each KV head is shared among 8 Q heads. This reduces the KV cache by 8x with minimal quality loss.

The Full Layer

A transformer is a stack of identical layers (80 layers for Llama 70B). Each layer has two sublayers:

  1. Attention: Each token looks at all previous tokens and computes a weighted sum (as described above)
  2. FFN (Feed-Forward Network): Each token independently passes through two large matrix multiplies with a nonlinearity (SwiGLU). This is where factual knowledge is stored — the FFN acts as a key-value memory.

Both sublayers use residual connections: the output is added to the input, not replacing it. This is critical for gradient flow during training.

The Shapes That Matter for Inference

📊

Key Dimensions (Llama 3 70B)

DimensionSymbolValueWhat It Is
Model dimension d_model 8192 Width of each token's representation vector
Attention heads n_heads 64 Number of parallel attention computations per layer
KV heads n_kv_heads 8 GQA: 8 KV heads shared among 64 Q heads (8x memory savings)
Head dimension d_head 128 d_model / n_heads — size of each head's Q/K/V vectors
FFN hidden dim d_ff 28672 3.5x d_model (SwiGLU uses 8/3 expansion ratio)
Vocab size V 128256 Number of possible tokens the model can output
Layers L 80 Number of transformer layers stacked
Total parameters N ~70B All weights stored in GPU HBM — 140 GB at FP16

KV Cache: Why It Exists

During generation, the model produces tokens one at a time (autoregressive). At each step, the attention mechanism needs ALL previous tokens’ Key and Value vectors. Without caching, you’d recompute K and V for all previous tokens every step — O(n2)O(n^2) total work for nn tokens.

KV cache: Store the K and V vectors from all previous tokens in GPU memory. Each new step only computes K, V for the new token and appends to the cache. Attention then reads the full cache but only computes Q for the new token.

Cache size per token per layer: 2×nkv_heads×dhead×dtype_bytes=2×8×128×2=4,0962 \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{dtype\_bytes} = 2 \times 8 \times 128 \times 2 = 4{,}096 bytes (Llama 70B, FP16).

Across 80 layers: 4,096×80=327,6804{,}096 \times 80 = 327{,}680 bytes = 320 KB per token.

At batch=32 with 4096 tokens per sequence: 32×4096×320KB=4032 \times 4096 \times 320\text{KB} = 40 GB of KV cache alone — half of an H100’s 80 GB memory. This is why KV cache management is the central problem in inference optimization, and why later posts in this series cover techniques like paged memory allocation and cache quantization.

Prefill vs Decode: Two Different Bottlenecks

LLM inference has two distinct phases with different performance characteristics:

The Two Phases of LLM Inference

PREFILL Process entire prompt at once (e.g., 1000 tokens) Compute-bound: large matrix multiplies saturate GPU tensor cores
DECODE Generate one token per forward pass, fed back as input Memory-bandwidth-bound: load 140 GB of weights to produce 1 token

Prefill processes all prompt tokens simultaneously as a single large matrix multiply. With 1000+ tokens, the arithmetic intensity is high enough to saturate tensor cores. GPU utilization: 60-80%. This is compute-bound.

Decode generates one token at a time. Each forward pass loads ALL 140 GB of model weights from HBM to compute just one token’s worth of output. Arithmetic intensity: 2×70B FLOPs140GB=1\frac{2 \times 70\text{B FLOPs}}{140\text{GB}} = 1 FLOP/byte. At the H100’s ridge point of 295 FLOP/byte, decode is 295x below the compute ceiling — purely memory-bandwidth-limited.

The math: At batch=1, decode throughput = HBM bandwidth / model size = 3.35 TB/s/140 GB=243.35 \text{ TB/s} / 140 \text{ GB} = 24 tokens/second. At batch=64, you amortize the weight loading: 3.35 TB/s/(140 GB/64)=1,5303.35 \text{ TB/s} / (140 \text{ GB} / 64) = 1,530 tokens/second.

Decode Throughput Scaling with Batch Size (Llama 70B, H100)

(tokens/sec)
Batch 1 Bandwidth-bound, GPU 0.3% utilized
24 tokens/sec
Batch 16 15x better: amortized weight loading
360 tokens/sec
Batch 64 Near-optimal bandwidth utilization
1,200 tokens/sec
Batch 256 Approaching compute saturation
2,400 tokens/sec

The Optimization Landscape

Every technique in this series attacks one of these bottlenecks:

📊

How Inference Optimizations Map to Bottlenecks

TechniqueCovered InWhat It DoesWhich Bottleneck
Batching Part 1 Process multiple requests simultaneously to amortize weight loading Memory bandwidth
KV Cache Part 2 Avoid recomputing attention keys/values for past tokens Compute (eliminates redundant work)
Quantization Part 3 Reduce weight precision (FP16 to INT8/INT4) — less data to load from HBM Memory bandwidth + capacity
FlashAttention Part 4 Tile attention computation to stay in fast SRAM instead of slow HBM Memory bandwidth
Paged KV Cache Part 5 Allocate KV cache in fixed-size pages (like OS virtual memory) to eliminate fragmentation Memory capacity
Continuous Batching Part 6 Add/remove requests from batch dynamically instead of waiting for all to finish GPU utilization
Speculative Decoding Part 7 Use a small model to draft multiple tokens, verify in one pass of the big model Memory bandwidth (multiple tokens per weight load)
💡 Reading Order for This Series

If you read the Transformer Anatomy series: Skip this bridge, go directly to Part 1. If you’re a systems engineer new to LLMs: Read this bridge, then Parts 1-7 (fundamentals through speculative decoding), then pick topics relevant to your work. If you’re deploying vLLM/SGLang: Read this bridge, then Parts 5-6 (paged KV cache, continuous batching), then the vLLM Internals series.

Continue to Part 1: LLM Inference Fundamentals for the full treatment with detailed roofline analysis, memory math, and throughput equations.