Part of Series Inference Optimization Timeline 1 of 23
1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

LLM inference looks simple from the outside — feed in a prompt, get text back. Under the hood, the story is radically different from training. Training processes entire sequences in parallel with known labels. Inference must generate tokens one at a time, where each new token depends on every token before it. This autoregressive bottleneck shapes every aspect of how we design, optimize, and deploy inference systems.

This post is the foundational reference for understanding LLM inference performance. We will cover the two-phase execution model (prefill and decode), derive the arithmetic intensity of each phase from first principles, work through detailed memory calculations for the Llama 3 family (8B, 70B, and 405B), analyze how batching transforms the compute-to-memory ratio, discuss the serving challenges that arise at scale, and ground everything in real measured numbers on A100 and H100 hardware.

If you work on ML infrastructure, model serving, or simply want to understand why your chatbot sometimes feels slow, this is the post to start with.


Why Inference Differs from Training

Before diving into the mechanics of inference, it is worth understanding why inference is a fundamentally different computational problem from training — even though both execute the same neural network architecture.

Training: Full Parallelism

During training, we have the complete input-output sequence available upfront. A training step for a batch of sequences works like this:

  1. Forward pass: Process all tokens in every sequence simultaneously. For a batch of BB sequences each of length SS, the input to each transformer layer is a matrix of shape [B×S,dmodel][B \times S, d_{model}]. The matrix multiplications are large and keep GPU compute units fully occupied.
  2. Loss computation: Compare predicted tokens against ground-truth labels (which we already know) across all positions.
  3. Backward pass: Compute gradients through the full computational graph. The backward pass involves the same large matrix multiplications as the forward pass, roughly doubling the total compute.
  4. Optimizer step: Update all parameters using the computed gradients.

The critical insight is that training sees the entire sequence at once. There is no sequential dependency between generating token tt and token t+1t+1 because we already have the ground truth for both. This means training is embarrassingly parallel across the sequence dimension, and the matrix multiplications are enormous — exactly what GPUs are designed for.

Inference: The Autoregressive Bottleneck

Inference is fundamentally sequential in the output dimension. To generate token t+1t+1, the model must:

  1. Have already generated token tt (and all tokens before it).
  2. Run a full forward pass through all layers to produce a probability distribution over the vocabulary.
  3. Sample or select token t+1t+1 from that distribution.
  4. Feed token t+1t+1 back as input and repeat.

This means we cannot generate the 50th output token until we have generated the first 49. Each forward pass during generation processes only a single new token (or, in speculative decoding, a small handful). The matrix multiplications shrink from [B×S,dmodel]×[dmodel,dff][B \times S, d_{model}] \times [d_{model}, d_{ff}] during training to [B×1,dmodel]×[dmodel,dff][B \times 1, d_{model}] \times [d_{model}, d_{ff}] during generation — the input dimension collapses from SS (potentially thousands) down to 11.

This is the autoregressive bottleneck. It means that during token generation, GPU compute units sit largely idle while the chip spends most of its time simply reading model weights from memory. The problem is not a lack of compute power; it is a lack of enough work to do per byte of data loaded.

The Core Asymmetry

Training is compute-bound: large matrix multiplications keep tensor cores busy. Inference generation is memory-bandwidth-bound: the model weights must be read from HBM for every single output token, but each token only requires a tiny amount of arithmetic per weight loaded. This asymmetry is the single most important fact in LLM inference.

The Roofline Perspective

The roofline model provides a clean framework for understanding this. A GPU has two ceilings: peak compute (measured in FLOPS) and peak memory bandwidth (measured in bytes/second). The ratio of these gives the ridge point — the arithmetic intensity (FLOP/byte) at which a workload transitions from memory-bound to compute-bound.

For an NVIDIA A100 SXM:

  • Peak FP16 tensor core compute: 312312 TFLOPS
  • Peak HBM bandwidth: 2.02.0 TB/s
  • Ridge point: 312/2.0=156312 / 2.0 = 156 FLOP/byte

For an NVIDIA H100 SXM:

  • Peak FP16 tensor core compute: 989989 TFLOPS
  • Peak HBM bandwidth: 3.353.35 TB/s
  • Ridge point: 989/3.35=295989 / 3.35 = 295 FLOP/byte

Any operation with arithmetic intensity below the ridge point is memory-bandwidth-bound. Any operation above it is compute-bound. As we will see, decode sits far below the ridge point, while prefill (with sufficiently long prompts) sits above it.


Prefill vs Decode: Two Phases, Two Bottlenecks

Every LLM inference request proceeds through two distinct phases that have opposite performance characteristics.

Phase 1: Prefill (Prompt Processing)

During prefill, all input prompt tokens are processed simultaneously in a single forward pass. If the prompt has PP tokens, the input to each transformer layer is a matrix of shape [P,dmodel][P, d_{model}]. This flows through:

  1. Self-attention: Query, key, and value projections are matrix multiplications of shape [P,dmodel]×[dmodel,dmodel][P, d_{model}] \times [d_{model}, d_{model}], producing O(P×dmodel2)O(P \times d_{model}^2) FLOPs per projection (three projections plus the output projection). The attention score computation is [P,dhead]×[dhead,P][P, d_{head}] \times [d_{head}, P] per head, which is O(P2×dhead)O(P^2 \times d_{head}) per head.
  2. Feed-forward network (FFN): Two (or three, for SwiGLU) matrix multiplications of shape [P,dmodel]×[dmodel,dff][P, d_{model}] \times [d_{model}, d_{ff}], producing O(P×dmodel×dff)O(P \times d_{model} \times d_{ff}) FLOPs each.

The key observation is that PP appears as a dimension of the input matrix. When P=512P = 512 or P=2048P = 2048, these are large GEMMs (general matrix multiplications) with high arithmetic intensity. The GPU’s tensor cores are fully utilized.

Arithmetic intensity of prefill (FFN layer):

Consider a single FFN weight matrix of shape [dmodel,dff][d_{model}, d_{ff}]. The input is [P,dmodel][P, d_{model}].

  • Bytes loaded: dmodel×dff×2d_{model} \times d_{ff} \times 2 bytes (FP16 weights) +P×dmodel×2+ P \times d_{model} \times 2 bytes (input activations). For large dffd_{ff}, the weight matrix dominates.
  • FLOPs: 2×P×dmodel×dff2 \times P \times d_{model} \times d_{ff} (multiply-accumulate).
  • Arithmetic intensity: approximately 2×P×dmodel×dffdmodel×dff×2=P\frac{2 \times P \times d_{model} \times d_{ff}}{d_{model} \times d_{ff} \times 2} = P FLOP/byte.

For a prompt of P=512P = 512 tokens, arithmetic intensity is 512\sim 512 FLOP/byte — well above the ridge point of both A100 (156156) and H100 (295295). Prefill is compute-bound.

In practice, prefill speed scales roughly linearly with prompt length (twice the prompt takes roughly twice as long, because there is twice the compute to do) until the prompt is so short that the operation falls below the ridge point.

Phase 2: Decode (Token Generation)

During decode, each forward pass processes a single new token. The input to each layer is a vector of shape [1,dmodel][1, d_{model}]. This changes the computational profile dramatically:

  1. Self-attention: The QKV projections become matrix-vector multiplications: [1,dmodel]×[dmodel,dmodel][1, d_{model}] \times [d_{model}, d_{model}]. The attention computation reads the entire KV cache (all previous keys and values) but only produces one new query.
  2. FFN: The multiplications become [1,dmodel]×[dmodel,dff][1, d_{model}] \times [d_{model}, d_{ff}] — again, matrix-vector products.

Arithmetic intensity of decode (FFN layer):

Same weight matrix [dmodel,dff][d_{model}, d_{ff}], but now the input is [1,dmodel][1, d_{model}]:

  • Bytes loaded: dmodel×dff×2d_{model} \times d_{ff} \times 2 bytes (weights still fully loaded).
  • FLOPs: 2×1×dmodel×dff2 \times 1 \times d_{model} \times d_{ff}.
  • Arithmetic intensity: 2×dmodel×dffdmodel×dff×2=1\frac{2 \times d_{model} \times d_{ff}}{d_{model} \times d_{ff} \times 2} = 1 FLOP/byte.

An arithmetic intensity of 11 FLOP/byte is catastrophically below the ridge point. The GPU’s tensor cores are idle more than 99%99\% of the time during single-request decode. Decode is memory-bandwidth-bound.

📊

Prefill vs Decode: Arithmetic Intensity Comparison

PropertyPrefill (P=512)Decode (B=1)Ratio
Input shape to FFN [512, d_model] [1, d_model] 512x
FLOPs (per FFN matrix) 2 * 512 * d * d_ff 2 * 1 * d * d_ff 512x
Bytes loaded (weights) d * d_ff * 2 d * d_ff * 2 1x (same!)
Arithmetic intensity ~512 FLOP/byte ~1 FLOP/byte 512x
Bottleneck Compute (tensor cores) Memory bandwidth (HBM) --
A100 utilization (FP16) 70-90% less than 1% --

This table reveals the fundamental tragedy of LLM decode: you must read the entire model’s weights from memory for every single output token, but each token only does a trivial amount of arithmetic with those weights. The hardware is designed for workloads at >150\gt 150 FLOP/byte, and decode delivers 1\sim 1 FLOP/byte.

The KV Cache: Avoiding Redundant Computation

There is one crucial optimization that makes autoregressive generation practical at all: the KV cache. Without it, generating token tt would require recomputing attention over all t1t-1 previous tokens from scratch — meaning the cost of generating a sequence of length TT would be O(T2)O(T^2) in the sequence dimension (and O(T3)O(T^3) total when accounting for the O(T)O(T) attention computation at each step).

The KV cache stores the key and value vectors computed during previous forward passes. When generating token tt, we only need to:

  1. Compute the new query, key, and value for the current token.
  2. Append the new key and value to the cache.
  3. Compute attention between the single new query and all tt cached keys/values.

This reduces the cost from O(T2)O(T^2) per step to O(T)O(T) per step — a critical optimization. But it comes at the cost of memory: the KV cache grows linearly with sequence length and must be stored in GPU memory. As we will see, this memory pressure becomes the dominant constraint at scale.


Memory Math: Llama 3 Family

Let us work through the memory requirements for the three sizes in the Llama 3 family: 8B, 70B, and 405B. Understanding these numbers is essential for capacity planning.

Model Weights

The memory required for model weights depends on the number of parameters and the precision:

  • FP16/BF16: 2 bytes per parameter
  • INT8: 1 byte per parameter
  • INT4: 0.5 bytes per parameter
📊

Llama 3 Model Weight Memory

ModelParametersFP16 (GB)INT8 (GB)INT4 (GB)Min GPUs (FP16, 80GB)
Llama 3 8B 8.03B 16.1 8.0 4.0 1
Llama 3 70B 70.6B 141.2 70.6 35.3 2
Llama 3 405B 405.1B 810.2 405.1 202.6 11

Just the weights for Llama 3 405B in FP16 require over 810 GB — more than 10 A100-80GB GPUs just to hold the parameters with zero room for anything else. This is why quantization and model parallelism are not optional at this scale; they are mandatory.

KV Cache Memory

The KV cache is where things get interesting — and where most practitioners underestimate memory requirements. The KV cache stores key and value vectors for every layer, every head, and every token in every active request.

Per-token KV cache size:

KV per token=2×nlayers×nkv_heads×dhead×bytes_per_element\text{KV per token} = 2 \times n_{layers} \times n_{kv\_heads} \times d_{head} \times \text{bytes\_per\_element}

The factor of 22 accounts for both keys and values. Note that Llama 3 uses Grouped Query Attention (GQA), where the number of KV heads (nkv_headsn_{kv\_heads}) is smaller than the number of query heads (nq_headsn_{q\_heads}). This is specifically designed to reduce KV cache size.

Let us compute for each model:

Llama 3 8B: nlayers=32n_{layers} = 32, nkv_heads=8n_{kv\_heads} = 8 (GQA with 4:1 ratio from 32 query heads), dhead=128d_{head} = 128

KV per token=2×32×8×128×2=131,072 bytes=0.125 MB\text{KV per token} = 2 \times 32 \times 8 \times 128 \times 2 = 131{,}072 \text{ bytes} = 0.125 \text{ MB}

Llama 3 70B: nlayers=80n_{layers} = 80, nkv_heads=8n_{kv\_heads} = 8 (GQA with 8:1 ratio from 64 query heads), dhead=128d_{head} = 128

KV per token=2×80×8×128×2=327,680 bytes=0.3125 MB\text{KV per token} = 2 \times 80 \times 8 \times 128 \times 2 = 327{,}680 \text{ bytes} = 0.3125 \text{ MB}

Llama 3 405B: nlayers=126n_{layers} = 126, nkv_heads=8n_{kv\_heads} = 8 (GQA with 16:1 ratio from 128 query heads), dhead=128d_{head} = 128

KV per token=2×126×8×128×2=516,096 bytes0.49 MB\text{KV per token} = 2 \times 126 \times 8 \times 128 \times 2 = 516{,}096 \text{ bytes} \approx 0.49 \text{ MB}

Now let us scale this to realistic serving scenarios. A single request with a context length of 4,0964{,}096 tokens:

📊

KV Cache Memory Per Request (context = 4,096 tokens, FP16)

ModelKV/token (MB)KV/request (GB)Model weights (GB)KV as % of weights
Llama 3 8B 0.125 0.5 16.1 3.1%
Llama 3 70B 0.3125 1.25 141.2 0.9%
Llama 3 405B 0.49 1.96 810.2 0.2%

At one request with 4K context, the KV cache is modest. But serving is not about one request — it is about hundreds or thousands of concurrent requests.

📊

KV Cache Memory at Scale (context = 4,096 tokens, FP16)

ModelBatch=1Batch=32Batch=128Batch=256
Llama 3 8B 0.5 GB 16 GB 64 GB 128 GB
Llama 3 70B 1.25 GB 40 GB 160 GB 320 GB
Llama 3 405B 1.96 GB 62.7 GB 250.9 GB 501.8 GB
⚠️ KV Cache Dominates at Scale

For Llama 3 8B at batch=128 with 4K context, the KV cache requires 64 GB — four times the 16 GB model weights. For Llama 3 70B at the same batch size, the KV cache requires 160 GB, exceeding the model weights (141 GB). At production batch sizes, the KV cache is the dominant memory consumer, not the model weights. This is why KV cache optimization (GQA, quantization, paged attention) is the most impactful area for serving efficiency.

With longer context lengths (32K, 128K, or the 128K context window that Llama 3 supports), the numbers become even more extreme. At 128K context with batch=32, Llama 3 8B would need 0.125×131,072×32=5120.125 \times 131{,}072 \times 32 = 512 GB of KV cache alone — an impossible number for a single GPU. This is why long-context serving requires aggressive KV cache compression techniques like quantization to INT8 or INT4, sliding window attention, or offloading to CPU memory.

Activation Memory

Activation memory is the working memory used during a forward pass to store intermediate results. Unlike model weights (which are constant) and the KV cache (which grows with sequence length), activation memory is proportional to the batch size and sequence length of the current forward pass.

During prefill, activation memory can be significant because the full prompt is processed at once. The dominant terms are:

  • Attention scores: O(B×nheads×P2)O(B \times n_{heads} \times P^2) without FlashAttention, or O(B×nheads×P)O(B \times n_{heads} \times P) with FlashAttention (which computes attention in tiles and does not materialize the full attention matrix).
  • Layer intermediate outputs: O(B×P×dff)O(B \times P \times d_{ff}) per layer.

For Llama 3 8B with B=1B=1, P=4096P=4096: without FlashAttention, the attention matrix alone would be 32×40962×232 \times 4096^2 \times 2 bytes =1= 1 GB per layer, times 32 layers =32= 32 GB. With FlashAttention, this drops to a few hundred MB total. FlashAttention is not optional for long-context inference.

During decode, activation memory is negligible because we process only one token per step. The intermediate tensors are tiny vectors rather than large matrices.

Total Memory Budget

Let us put it all together for a concrete deployment scenario: Llama 3 70B on 2x A100-80GB (160 GB total) with FP16 weights, serving batch=32 at 4K context.

GPU Memory Breakdown: Llama 3 70B on 2x A100-80GB (batch=32, 4K context)

(GB)
Model weights (FP16)
141 GB
KV cache (32 reqs x 4K) Grows with batch x seq_len
40 GB
Activations + workspace
4 GB
CUDA context + overhead
4 GB
Remaining headroom OVER BUDGET by 29 GB!
-29 GB

We are 29 GB over budget! The solution space includes:

  1. Quantize weights to INT8: reduces weights from 141 GB to 70.6 GB, freeing 70 GB. Total becomes 119 GB — fits with headroom.
  2. Quantize KV cache to INT8: reduces KV from 40 GB to 20 GB.
  3. Reduce batch size: batch=16 halves KV cache to 20 GB, total becomes 169 GB… still tight.
  4. Add more GPUs: 4x A100 gives 320 GB but increases communication overhead.

In practice, production deployments use a combination of INT8 or INT4 weight quantization, INT8 KV cache quantization, and enough GPUs to provide headroom for burst traffic. The memory budget is always the binding constraint.


Batch Size Effects: How Batching Changes Everything

Batching is the single most important lever in LLM inference performance. It transforms decode from a catastrophically underutilized operation into something approaching reasonable GPU utilization.

The Mechanics of Batched Decode

When we batch BB decode requests together, the forward pass processes BB tokens simultaneously. The FFN computation changes from a matrix-vector product to a matrix-matrix product:

  • Unbatched: [1,dmodel]×[dmodel,dff][1, d_{model}] \times [d_{model}, d_{ff}] — matrix-vector
  • Batched (B): [B,dmodel]×[dmodel,dff][B, d_{model}] \times [d_{model}, d_{ff}] — matrix-matrix (small GEMM)

The weight matrix [dmodel,dff][d_{model}, d_{ff}] is loaded once from HBM regardless of batch size. But with batch size BB, we perform BB times more FLOPs with that single read.

Arithmetic intensity of batched decode:

Arithmetic intensity=2×B×dmodel×dffdmodel×dff×2=B FLOP/byte\text{Arithmetic intensity} = \frac{2 \times B \times d_{model} \times d_{ff}}{d_{model} \times d_{ff} \times 2} = B \text{ FLOP/byte}

At B=1B=1: 11 FLOP/byte (memory-bound). At B=32B=32: 3232 FLOP/byte (still memory-bound on A100, but much better). At B=156B=156: 156156 FLOP/byte (at the ridge point for A100 — balanced). At B=295B=295: 295295 FLOP/byte (at the ridge point for H100 — balanced). At B>295B \gt 295: Compute-bound on H100 — tensor cores become the bottleneck.

Arithmetic Intensity vs Batch Size During Decode

(FLOP/byte)
B=1 Catastrophically memory-bound
1 FLOP/byte
B=8
8 FLOP/byte
B=32
32 FLOP/byte
B=64
64 FLOP/byte
B=128
128 FLOP/byte
B=156 A100 ridge point
156 FLOP/byte
B=295 H100 ridge point
295 FLOP/byte

Throughput vs Latency Tradeoff

Increasing batch size improves throughput (total tokens generated per second across all requests) but degrades per-request latency (time for an individual token). Here is why:

  • Throughput: Weight reads are amortized across BB requests. Total throughput scales roughly linearly with BB until we hit the compute ceiling or run out of memory.
  • Latency: With larger batches, each forward pass takes longer because there is more compute to do. The KV cache also grows, making the attention computation slower. For a single user, their tokens come slower.

This is the fundamental tension in LLM serving: maximizing throughput for the provider vs minimizing latency for the user.

📊

Batch Size Impact on Throughput and Latency (Llama 3 8B, A100-80GB, FP16)

Batch SizeThroughput (tok/s)Per-Request Latency (ms/tok)GPU Compute UtilKV Cache (GB, 2K ctx)
1 ~140 ~7 less than 1% 0.25
4 ~540 ~7.4 ~3% 1.0
16 ~2,000 ~8 ~10% 4.0
32 ~3,600 ~8.9 ~20% 8.0
64 ~5,800 ~11 ~40% 16.0
128 ~7,200 ~17.8 ~75% 32.0
256 ~7,800 ~32.8 ~95% 64.0

Several things to notice:

  1. Throughput scales almost linearly from batch 1 to batch 64, then starts to plateau as compute utilization approaches the ceiling.
  2. Per-request latency stays relatively flat up to batch 32 (the operation is still memory-bound, so adding work is “free”), then starts rising as we approach the compute ridge point.
  3. KV cache memory scales linearly with batch size — at batch 256, it is 64 GB, consuming most of the A100’s memory.
  4. The “sweet spot” for serving is typically in the range of batch 32-64, where throughput is high but latency is still acceptable and KV cache has not consumed all memory.

The Memory Wall

The maximum achievable batch size is constrained by GPU memory:

Bmax=GPU memoryweightsoverheadKV per requestB_{max} = \frac{\text{GPU memory} - \text{weights} - \text{overhead}}{\text{KV per request}}

For Llama 3 8B on A100-80GB with FP16 weights and 4K context:

Bmax=8016.140.5=59.90.5=119B_{max} = \frac{80 - 16.1 - 4}{0.5} = \frac{59.9}{0.5} = 119

For Llama 3 70B on 2x A100-80GB with FP16 weights and 4K context:

Bmax=160141.261.25=12.81.25=10B_{max} = \frac{160 - 141.2 - 6}{1.25} = \frac{12.8}{1.25} = 10

Only 10 concurrent requests! This is why weight quantization is so important for larger models — it frees memory for KV cache, enabling larger batch sizes, which improves throughput. With INT8 weights (70.6 GB), the budget becomes:

Bmax=16070.661.25=83.41.25=66B_{max} = \frac{160 - 70.6 - 6}{1.25} = \frac{83.4}{1.25} = 66

Quantizing weights from FP16 to INT8 increased max batch size from 10 to 66 — a 6.6×6.6\times improvement in maximum throughput, just from weight quantization. This is why quantization is listed as a serving optimization, not just a model compression technique.

💡 Quantization is a Serving Optimization

Weight quantization is not just about making models smaller. In a serving context, its primary benefit is freeing GPU memory for KV cache, which enables larger batch sizes, which dramatically improves throughput. Reducing Llama 3 70B from FP16 to INT4 weights frees ~106 GB across 2 GPUs — enough for ~85 additional concurrent 4K-context requests.


The Serving Problem

Moving from a single inference request to a production serving system introduces a cascade of new challenges. Naive inference wastes enormous amounts of GPU resources, and the key innovations in LLM serving are all about closing this efficiency gap.

Why Naive Inference Wastes GPU

Consider a naive serving setup: a queue of requests, processed one at a time. Each request goes through prefill, then generates tokens until completion (or max length), then the next request starts.

The problems are severe:

  1. Zero batching during decode: Each decode step reads the full model weights for a single token. On A100, this means <1%\lt 1\% compute utilization during generation — the most expensive hardware in the world is sitting idle.

  2. Wasted prefill capacity: While one request is decoding, the GPU could be prefilling new requests (prefill is compute-bound and decode is memory-bound — they use different resources). But naive sequential processing cannot overlap them.

  3. Variable request lengths: Requests have different prompt lengths and generate different numbers of output tokens. With static batching, the entire batch waits for the longest request to finish. If one request generates 10 tokens and another generates 500, the GPU is idle for 98%98\% of the time after the short request finishes.

  4. Memory fragmentation: Different requests occupy different amounts of KV cache memory. As requests arrive and complete, the KV cache develops holes — allocated but unused memory regions that cannot be reclaimed for new requests.

Static Batching: The First Improvement

The simplest optimization is to batch multiple requests together: collect BB requests, prefill them all, then decode them all simultaneously. This improves throughput proportionally to the batch size (as discussed above).

But static batching has a critical flaw: all requests in a batch must start and end together. When the first request in a batch finishes generating, its slot sits empty until the entire batch completes. In practice, output lengths vary widely (some responses are 20 tokens, others are 2,000), so the average utilization of a static batch can be as low as 40%40\%-60%60\%.

Continuous Batching: The Key Innovation

Continuous batching (also called iteration-level scheduling or inflight batching) solves this by making scheduling decisions at each decode step rather than at the batch level.

The core idea:

  1. After each decode iteration, check if any requests in the batch have finished (hit EOS token or max length).
  2. If so, immediately evict them and insert waiting requests from the queue.
  3. The new requests go through prefill (which can be interleaved or chunked) and then join the decode batch.

This means the batch stays full continuously — as soon as one request finishes, another takes its place. Utilization jumps from 40%40\%-60%60\% to 85%85\%-95%95\% or higher.

The engineering challenge is managing the KV cache for a dynamically changing set of requests. When a request finishes and a new one starts, the new request’s KV cache must be allocated and the old one freed, without fragmenting the contiguous memory blocks that GPU kernels require.

Paged Attention and KV Cache Management

PagedAttention, introduced by the vLLM system, applies the operating system concept of virtual memory paging to KV cache management. Instead of allocating a contiguous block of memory for each request’s KV cache (which leads to internal fragmentation when requests have variable lengths), PagedAttention:

  1. Divides KV cache memory into fixed-size blocks (analogous to memory pages).
  2. Maintains a block table mapping each request’s logical KV positions to physical memory blocks.
  3. Allocates blocks on demand as sequences grow, and frees them immediately when sequences complete.

This eliminates nearly all internal fragmentation and enables near-optimal memory utilization. In practice, PagedAttention increases the effective KV cache capacity by 2×2\times-4×4\times compared to contiguous allocation, enabling proportionally larger batch sizes.

The tradeoff is a small overhead from the indirection (looking up block tables during attention computation), but this is negligible compared to the throughput gains from better memory utilization.

📊

Serving Strategy Comparison (Llama 3 8B, A100, mixed workload)

StrategyAvg Throughput (tok/s)GPU UtilizationMemory WasteScheduling Overhead
Sequential (B=1) 140 less than 1% None None
Static batching (B=32) 3,600 ~20% 30-50% Low
Continuous batching (B=32) 5,400 ~30% 10-20% Medium
Continuous + PagedAttention 7,000 ~45% less than 5% Medium
+ Chunked prefill + overlap 8,500 ~55% less than 5% High

Chunked Prefill and Prefill-Decode Overlap

Another important serving optimization is chunked prefill: instead of processing the entire prompt in a single forward pass (which can cause latency spikes for long prompts), split the prompt into chunks and process each chunk in a separate iteration. This allows:

  1. Interleaving with decode: While processing a chunk of a new request’s prompt, the decode tokens from existing requests can be included in the same forward pass. This prevents prefill from stalling ongoing decode requests.
  2. Bounded latency: No single iteration takes longer than a fixed time (determined by chunk size), making TTFT more predictable.
  3. Better GPU utilization: Each iteration has a mix of prefill tokens (compute-heavy) and decode tokens (memory-heavy), which better utilizes both compute and memory bandwidth simultaneously.

The optimal chunk size balances prefill throughput (larger chunks = fewer overheads) against decode latency (larger chunks = longer iterations that delay decode tokens). Typical chunk sizes are 256-1024 tokens.

Request Scheduling and Preemption

In a production system with limited GPU memory, sometimes accepting a new request requires evicting (preempting) a partially completed request. The evicted request’s KV cache can be:

  1. Dropped: The request is re-queued and must restart from scratch (re-prefill). Simple but wasteful.
  2. Swapped to CPU memory: The KV cache is copied to CPU RAM (much larger but slower). When GPU memory becomes available, the cache is swapped back and decode resumes. This preserves work but adds swap latency.
  3. Recomputed on demand: A hybrid approach where the prompt portion of the KV cache is recomputed (via prefill) but only on the portions needed for resuming generation.

Modern serving systems like vLLM, TensorRT-LLM, and SGLang implement sophisticated scheduling policies that balance throughput, latency, and fairness across concurrent requests.

The Serving Stack

A modern LLM serving system combines multiple techniques: continuous batching for utilization, PagedAttention for memory efficiency, chunked prefill for latency predictability, weight quantization for capacity, and KV cache quantization for batch size. Each technique addresses a different bottleneck, and they are multiplicatively beneficial. This is why frameworks like vLLM, TensorRT-LLM, and SGLang exist — the engineering surface area is enormous.


Real Numbers: Latency and Throughput on A100 and H100

Theory is essential, but numbers are what matter for production decisions. The following benchmarks represent typical performance numbers for popular models on A100-80GB SXM and H100-80GB SXM GPUs, measured with optimized serving frameworks (vLLM, TensorRT-LLM) under realistic conditions. All numbers are approximate and vary with framework version, driver version, quantization calibration, and workload characteristics.

Hardware Comparison

📊

A100 vs H100 SXM Specifications (Relevant to Inference)

SpecificationA100 SXMH100 SXMH100/A100 Ratio
HBM capacity 80 GB (HBM2e) 80 GB (HBM3) 1.0x
HBM bandwidth 2.0 TB/s 3.35 TB/s 1.67x
FP16 tensor TFLOPS 312 989 3.17x
INT8 tensor TOPS 624 1,978 3.17x
FP8 tensor TFLOPS N/A 1,978 N/A
Ridge point (FP16) 156 FLOP/byte 295 FLOP/byte 1.89x
NVLink bandwidth (per GPU) 600 GB/s 900 GB/s 1.5x
TDP 400W 700W 1.75x

For memory-bound operations (decode), the H100’s advantage is primarily from its 1.67×1.67\times higher memory bandwidth. For compute-bound operations (prefill), the advantage is 3.17×3.17\times (FP16) or even higher with FP8.

Llama 3 8B: Single GPU Performance

📊

Llama 3 8B Inference Performance (Single GPU, 2K input / 512 output tokens)

MetricA100 FP16A100 INT8H100 FP16H100 FP8
TTFT (B=1) ~45 ms ~35 ms ~18 ms ~12 ms
ITL (B=1) ~7.1 ms ~5.8 ms ~4.2 ms ~3.5 ms
Decode tok/s (B=1) ~140 ~172 ~238 ~286
Decode tok/s (B=32) ~3,600 ~4,200 ~6,100 ~7,800
Decode tok/s (B=64) ~5,800 ~7,000 ~9,800 ~13,500
Max batch (4K ctx) ~120 ~200 ~120 ~240

Key observations for the 8B model:

  • At batch=1, throughput is determined almost entirely by memory bandwidth. The H100’s 1.67×1.67\times bandwidth advantage translates to a 1.7×1.7\times speedup in decode.
  • INT8 quantization on A100 provides a 1.2×\sim 1.2\times throughput improvement at small batch sizes (from reduced memory reads) and 1.2×\sim 1.2\times at large batch sizes (from fitting more requests).
  • FP8 on H100 provides the best overall performance: faster compute for prefill and reduced memory footprint for larger batches.
  • The maximum batch size approximately doubles with quantization (halving the weight memory frees room for more KV cache).

Llama 3 70B: Multi-GPU Performance

📊

Llama 3 70B Inference Performance (4K input / 1K output tokens)

ConfigTTFT (B=1)ITL (B=1)Throughput B=1Throughput B=32Max batch
2x A100 FP16 (TP=2) ~320 ms ~15 ms ~67 tok/s ~1,400 tok/s ~10
4x A100 INT8 (TP=4) ~95 ms ~6.5 ms ~154 tok/s ~3,800 tok/s ~100
2x H100 FP16 (TP=2) ~130 ms ~9 ms ~111 tok/s ~2,400 tok/s ~10
4x H100 FP8 (TP=4) ~35 ms ~3.5 ms ~286 tok/s ~7,200 tok/s ~120
8x H100 FP8 (TP=8) ~20 ms ~2.2 ms ~455 tok/s ~12,000 tok/s ~280

Key observations for the 70B model:

  • Tensor parallelism (TP) is essential. On 2x A100 with FP16 weights, max batch is only ~10 — barely viable for production. The memory is almost entirely consumed by weights.
  • 4x H100 with FP8 is the sweet spot for many production deployments: good single-request latency (~3.5 ms/tok ITL), strong throughput at batch 32 (7,200 tok/s), and enough memory for 120 concurrent requests.
  • 8x H100 provides diminishing returns per GPU but offers the lowest absolute latency (2.2 ms/tok ITL) and highest throughput for latency-sensitive applications.
  • TTFT varies dramatically with tensor parallelism: more GPUs = faster prefill (compute splits across GPUs, though NVLink communication adds overhead).

Llama 3 405B: The Scale Challenge

📊

Llama 3 405B Inference Performance (2K input / 512 output tokens)

ConfigTTFT (B=1)ITL (B=1)Throughput B=1Throughput B=16Min GPUs
16x A100 INT8 (TP=8, PP=2) ~380 ms ~22 ms ~45 tok/s ~600 tok/s 16
8x H100 FP8 (TP=8) ~120 ms ~11 ms ~91 tok/s ~1,200 tok/s 8
16x H100 FP8 (TP=8, PP=2) ~75 ms ~7 ms ~143 tok/s ~2,000 tok/s 16

At 405B parameters, the operational reality is stark:

  • Minimum 8 H100s (with FP8) just to fit the model in memory with room for a modest batch. With FP16, you need 16+ GPUs.
  • Pipeline parallelism (PP) becomes necessary alongside tensor parallelism when model weights exceed the combined memory of a single NVLink domain (typically 8 GPUs). PP adds pipeline bubbles that reduce efficiency by 10%10\%-20%20\%.
  • Cost per token is extremely high. At 2,000\sim 2,000 tok/s on 16x H100 with batch=16, the cost is roughly 8×8\times higher per token compared to the 70B model on 4x H100 at batch=32.
  • For most production use cases, the 70B model with good quantization provides a much better cost-performance tradeoff than the 405B model.
💡 Practical Guidance on GPU Selection

For Llama 3 8B: a single H100 (or A100 with INT8) handles most workloads comfortably. For Llama 3 70B: 4x H100 with FP8 is the production sweet spot. For Llama 3 405B: evaluate whether the quality improvement over 70B justifies 4×4\times-8×8\times higher serving cost — for many applications, it does not.

Decode Throughput Scaling Curves

To illustrate how throughput scales with batch size on real hardware, here are measured curves for Llama 3 8B:

Llama 3 8B Decode Throughput vs Batch Size (A100 FP16)

(tok/s)
Batch 1 Memory-latency bound
140 tok/s
Batch 4
530 tok/s
Batch 8
980 tok/s
Batch 16
1,850 tok/s
Batch 32
3,600 tok/s
Batch 64 Approaching BW limit
5,800 tok/s
Batch 128 Near BW saturation
7,200 tok/s
Batch 256 Compute starts limiting
7,800 tok/s

Llama 3 8B Decode Throughput vs Batch Size (H100 FP16)

(tok/s)
Batch 1 1.7x over A100 (BW ratio)
238 tok/s
Batch 4
900 tok/s
Batch 8
1,700 tok/s
Batch 16
3,200 tok/s
Batch 32
6,100 tok/s
Batch 64
9,800 tok/s
Batch 128 Approaching BW limit
14,200 tok/s
Batch 256 Compute starts limiting
16,500 tok/s

The scaling pattern is consistent: near-linear throughput growth with batch size in the memory-bound regime, followed by a plateau as compute becomes the bottleneck. The H100’s higher memory bandwidth pushes the transition point to a higher batch size, and its higher compute ceiling raises the plateau.


Putting It All Together: The Mental Model

Let us synthesize everything into a coherent mental model for LLM inference performance.

The Three Resources

LLM inference performance is determined by three scarce resources:

  1. Compute (FLOPS): Tensor core throughput. Limits prefill speed and decode throughput at high batch sizes.
  2. Memory bandwidth (GB/s): HBM read throughput. Limits single-request decode speed and decode throughput at low-to-moderate batch sizes.
  3. Memory capacity (GB): Total HBM. Limits model size, maximum batch size (via KV cache), and maximum context length.

At any given moment, one of these is the binding constraint. The art of inference optimization is shifting the bottleneck to a different resource (typically from memory bandwidth to compute, via batching) or expanding the binding resource (more GPUs, faster memory, quantization to reduce capacity pressure).

Decision Framework

When evaluating an inference deployment, ask these questions in order:

1. Can the model fit in memory? Calculate: weights + KV cache (at target batch size and context length) + overhead. If this exceeds available GPU memory, you need more GPUs, quantization, or a smaller model.

2. What is the decode bottleneck? Calculate arithmetic intensity at your target batch size. If B<B \lt ridge point, you are memory-bandwidth-bound: faster memory or more batching helps; more compute does not. If B>B \gt ridge point, you are compute-bound: faster compute helps; more memory bandwidth does not.

3. What limits batch size? Usually KV cache memory. Calculate BmaxB_{max} as shown above. If BmaxB_{max} is below the ridge point, you will never fully utilize compute — consider KV cache quantization, GQA, or more memory to increase BmaxB_{max}.

4. What are the latency requirements? TTFT is bounded by prefill time (proportional to prompt length, inversely proportional to compute throughput). ITL is bounded by decode step time (inversely proportional to memory bandwidth at low batch, or compute at high batch). If latency requirements are strict, you may need to sacrifice throughput (smaller batch) or add more GPUs (split work).

The Optimization Stack

Each optimization targets a specific bottleneck:

📊

Inference Optimization Hierarchy

OptimizationTarget BottleneckTypical ImpactWhen to Apply
Batching (B=1 to B=32+) BW utilization 10-30x throughput Always -- this is non-negotiable
Continuous batching GPU utilization 2-3x throughput Any multi-request serving
Weight quantization (FP16 to INT4/8) Memory capacity 2-4x batch capacity When memory-constrained
KV cache quantization Memory capacity 1.5-2x batch capacity Long context or large batch
PagedAttention Memory fragmentation 2-4x effective capacity Variable-length serving
FlashAttention Prefill compute + memory 2-4x prefill speed Always for long prompts
Tensor parallelism Single-GPU limits ~Nx (N GPUs) w/ overhead Model exceeds 1 GPU
Speculative decoding Decode latency 1.5-3x per-request speed Latency-sensitive apps
Prefix caching Redundant prefill 2-10x TTFT for repeated prefixes RAG, system prompts
FP8 (H100/B200) Compute + capacity 1.5-2x over FP16 When hardware supports it

The Cost Equation

Ultimately, inference is about cost per token. The cost equation is:

Cost per token=GPU cost per secondThroughput (tok/s)\text{Cost per token} = \frac{\text{GPU cost per second}}{\text{Throughput (tok/s)}}

Every optimization that increases throughput (without proportionally increasing GPU cost) reduces cost per token. This is why batching is so powerful: it increases throughput by 1010-30×30\times with zero additional hardware cost.

For a concrete example: Llama 3 70B on 4x H100 (cost: \sim\16$/hour in cloud pricing):

  • Batch=1: 286\sim 286 tok/s, cost = \16 / 3600 / 286 \approx $0.0000155pertoken( per token ($15.5$ per million tokens)
  • Batch=32: 7,200\sim 7{,}200 tok/s, cost = \16 / 3600 / 7200 \approx $0.00000062pertoken( per token ($0.62$ per million tokens)

Batching reduced cost by 25×25\times. This is why API providers can offer LLM inference at prices that seem impossibly low — they are batching aggressively across thousands of concurrent requests.


Conclusion

LLM inference is defined by a fundamental asymmetry: prefill is compute-bound (large matrix multiplications that saturate tensor cores), while decode is memory-bandwidth-bound (full model weight reads for each single generated token). This asymmetry, rooted in the autoregressive nature of language generation, drives every design decision in the inference stack.

The key quantitative relationships to internalize:

  • Arithmetic intensity during decode = batch size (in FLOP/byte). At batch=1, utilization is below 1%1\%. The ridge point for A100 is 156\sim 156, for H100 is 295\sim 295.
  • KV cache scales as 2×nlayers×nkv_heads×dhead×seq_len×batch_size2 \times n_{layers} \times n_{kv\_heads} \times d_{head} \times \text{seq\_len} \times \text{batch\_size}. It dominates memory at production batch sizes.
  • Maximum batch size is memory-limited, and quantization’s primary serving benefit is freeing memory for larger batches.
  • Throughput scales linearly with batch size until compute saturation, then plateaus.

The serving innovations — continuous batching, PagedAttention, chunked prefill, KV cache quantization — all serve a single goal: maximize the number of concurrent requests the GPU can process, thereby amortizing the fixed cost of reading model weights across more useful output tokens.

If you take away one thing from this post: LLM decode performance is not about making the GPU compute faster. It is about giving the GPU more useful work to do with each byte it reads from memory. Every optimization in the inference stack — from batching to quantization to paged memory management — is ultimately in service of this principle.