LLM inference looks simple from the outside — feed in a prompt, get text back. Under the hood, the story is radically different from training. Training processes entire sequences in parallel with known labels. Inference must generate tokens one at a time, where each new token depends on every token before it. This autoregressive bottleneck shapes every aspect of how we design, optimize, and deploy inference systems.
This post is the foundational reference for understanding LLM inference performance. We will cover the two-phase execution model (prefill and decode), derive the arithmetic intensity of each phase from first principles, work through detailed memory calculations for the Llama 3 family (8B, 70B, and 405B), analyze how batching transforms the compute-to-memory ratio, discuss the serving challenges that arise at scale, and ground everything in real measured numbers on A100 and H100 hardware.
If you work on ML infrastructure, model serving, or simply want to understand why your chatbot sometimes feels slow, this is the post to start with.
Why Inference Differs from Training
Before diving into the mechanics of inference, it is worth understanding why inference is a fundamentally different computational problem from training — even though both execute the same neural network architecture.
Training: Full Parallelism
During training, we have the complete input-output sequence available upfront. A training step for a batch of sequences works like this:
- Forward pass: Process all tokens in every sequence simultaneously. For a batch of sequences each of length , the input to each transformer layer is a matrix of shape . The matrix multiplications are large and keep GPU compute units fully occupied.
- Loss computation: Compare predicted tokens against ground-truth labels (which we already know) across all positions.
- Backward pass: Compute gradients through the full computational graph. The backward pass involves the same large matrix multiplications as the forward pass, roughly doubling the total compute.
- Optimizer step: Update all parameters using the computed gradients.
The critical insight is that training sees the entire sequence at once. There is no sequential dependency between generating token and token because we already have the ground truth for both. This means training is embarrassingly parallel across the sequence dimension, and the matrix multiplications are enormous — exactly what GPUs are designed for.
Inference: The Autoregressive Bottleneck
Inference is fundamentally sequential in the output dimension. To generate token , the model must:
- Have already generated token (and all tokens before it).
- Run a full forward pass through all layers to produce a probability distribution over the vocabulary.
- Sample or select token from that distribution.
- Feed token back as input and repeat.
This means we cannot generate the 50th output token until we have generated the first 49. Each forward pass during generation processes only a single new token (or, in speculative decoding, a small handful). The matrix multiplications shrink from during training to during generation — the input dimension collapses from (potentially thousands) down to .
This is the autoregressive bottleneck. It means that during token generation, GPU compute units sit largely idle while the chip spends most of its time simply reading model weights from memory. The problem is not a lack of compute power; it is a lack of enough work to do per byte of data loaded.
Training is compute-bound: large matrix multiplications keep tensor cores busy. Inference generation is memory-bandwidth-bound: the model weights must be read from HBM for every single output token, but each token only requires a tiny amount of arithmetic per weight loaded. This asymmetry is the single most important fact in LLM inference.
The Roofline Perspective
The roofline model provides a clean framework for understanding this. A GPU has two ceilings: peak compute (measured in FLOPS) and peak memory bandwidth (measured in bytes/second). The ratio of these gives the ridge point — the arithmetic intensity (FLOP/byte) at which a workload transitions from memory-bound to compute-bound.
For an NVIDIA A100 SXM:
- Peak FP16 tensor core compute: TFLOPS
- Peak HBM bandwidth: TB/s
- Ridge point: FLOP/byte
For an NVIDIA H100 SXM:
- Peak FP16 tensor core compute: TFLOPS
- Peak HBM bandwidth: TB/s
- Ridge point: FLOP/byte
Any operation with arithmetic intensity below the ridge point is memory-bandwidth-bound. Any operation above it is compute-bound. As we will see, decode sits far below the ridge point, while prefill (with sufficiently long prompts) sits above it.
Prefill vs Decode: Two Phases, Two Bottlenecks
Every LLM inference request proceeds through two distinct phases that have opposite performance characteristics.
Phase 1: Prefill (Prompt Processing)
During prefill, all input prompt tokens are processed simultaneously in a single forward pass. If the prompt has tokens, the input to each transformer layer is a matrix of shape . This flows through:
- Self-attention: Query, key, and value projections are matrix multiplications of shape , producing FLOPs per projection (three projections plus the output projection). The attention score computation is per head, which is per head.
- Feed-forward network (FFN): Two (or three, for SwiGLU) matrix multiplications of shape , producing FLOPs each.
The key observation is that appears as a dimension of the input matrix. When or , these are large GEMMs (general matrix multiplications) with high arithmetic intensity. The GPU’s tensor cores are fully utilized.
Arithmetic intensity of prefill (FFN layer):
Consider a single FFN weight matrix of shape . The input is .
- Bytes loaded: bytes (FP16 weights) bytes (input activations). For large , the weight matrix dominates.
- FLOPs: (multiply-accumulate).
- Arithmetic intensity: approximately FLOP/byte.
For a prompt of tokens, arithmetic intensity is FLOP/byte — well above the ridge point of both A100 () and H100 (). Prefill is compute-bound.
In practice, prefill speed scales roughly linearly with prompt length (twice the prompt takes roughly twice as long, because there is twice the compute to do) until the prompt is so short that the operation falls below the ridge point.
Phase 2: Decode (Token Generation)
During decode, each forward pass processes a single new token. The input to each layer is a vector of shape . This changes the computational profile dramatically:
- Self-attention: The QKV projections become matrix-vector multiplications: . The attention computation reads the entire KV cache (all previous keys and values) but only produces one new query.
- FFN: The multiplications become — again, matrix-vector products.
Arithmetic intensity of decode (FFN layer):
Same weight matrix , but now the input is :
- Bytes loaded: bytes (weights still fully loaded).
- FLOPs: .
- Arithmetic intensity: FLOP/byte.
An arithmetic intensity of FLOP/byte is catastrophically below the ridge point. The GPU’s tensor cores are idle more than of the time during single-request decode. Decode is memory-bandwidth-bound.
Prefill vs Decode: Arithmetic Intensity Comparison
| Property | Prefill (P=512) | Decode (B=1) | Ratio |
|---|---|---|---|
| Input shape to FFN | [512, d_model] | [1, d_model] | 512x |
| FLOPs (per FFN matrix) | 2 * 512 * d * d_ff | 2 * 1 * d * d_ff | 512x |
| Bytes loaded (weights) | d * d_ff * 2 | d * d_ff * 2 | 1x (same!) |
| Arithmetic intensity | ~512 FLOP/byte | ~1 FLOP/byte | 512x |
| Bottleneck | Compute (tensor cores) | Memory bandwidth (HBM) | -- |
| A100 utilization (FP16) | 70-90% | less than 1% | -- |
This table reveals the fundamental tragedy of LLM decode: you must read the entire model’s weights from memory for every single output token, but each token only does a trivial amount of arithmetic with those weights. The hardware is designed for workloads at FLOP/byte, and decode delivers FLOP/byte.
The KV Cache: Avoiding Redundant Computation
There is one crucial optimization that makes autoregressive generation practical at all: the KV cache. Without it, generating token would require recomputing attention over all previous tokens from scratch — meaning the cost of generating a sequence of length would be in the sequence dimension (and total when accounting for the attention computation at each step).
The KV cache stores the key and value vectors computed during previous forward passes. When generating token , we only need to:
- Compute the new query, key, and value for the current token.
- Append the new key and value to the cache.
- Compute attention between the single new query and all cached keys/values.
This reduces the cost from per step to per step — a critical optimization. But it comes at the cost of memory: the KV cache grows linearly with sequence length and must be stored in GPU memory. As we will see, this memory pressure becomes the dominant constraint at scale.
Memory Math: Llama 3 Family
Let us work through the memory requirements for the three sizes in the Llama 3 family: 8B, 70B, and 405B. Understanding these numbers is essential for capacity planning.
Model Weights
The memory required for model weights depends on the number of parameters and the precision:
- FP16/BF16: 2 bytes per parameter
- INT8: 1 byte per parameter
- INT4: 0.5 bytes per parameter
Llama 3 Model Weight Memory
| Model | Parameters | FP16 (GB) | INT8 (GB) | INT4 (GB) | Min GPUs (FP16, 80GB) |
|---|---|---|---|---|---|
| Llama 3 8B | 8.03B | 16.1 | 8.0 | 4.0 | 1 |
| Llama 3 70B | 70.6B | 141.2 | 70.6 | 35.3 | 2 |
| Llama 3 405B | 405.1B | 810.2 | 405.1 | 202.6 | 11 |
Just the weights for Llama 3 405B in FP16 require over 810 GB — more than 10 A100-80GB GPUs just to hold the parameters with zero room for anything else. This is why quantization and model parallelism are not optional at this scale; they are mandatory.
KV Cache Memory
The KV cache is where things get interesting — and where most practitioners underestimate memory requirements. The KV cache stores key and value vectors for every layer, every head, and every token in every active request.
Per-token KV cache size:
The factor of accounts for both keys and values. Note that Llama 3 uses Grouped Query Attention (GQA), where the number of KV heads () is smaller than the number of query heads (). This is specifically designed to reduce KV cache size.
Let us compute for each model:
Llama 3 8B: , (GQA with 4:1 ratio from 32 query heads),
Llama 3 70B: , (GQA with 8:1 ratio from 64 query heads),
Llama 3 405B: , (GQA with 16:1 ratio from 128 query heads),
Now let us scale this to realistic serving scenarios. A single request with a context length of tokens:
KV Cache Memory Per Request (context = 4,096 tokens, FP16)
| Model | KV/token (MB) | KV/request (GB) | Model weights (GB) | KV as % of weights |
|---|---|---|---|---|
| Llama 3 8B | 0.125 | 0.5 | 16.1 | 3.1% |
| Llama 3 70B | 0.3125 | 1.25 | 141.2 | 0.9% |
| Llama 3 405B | 0.49 | 1.96 | 810.2 | 0.2% |
At one request with 4K context, the KV cache is modest. But serving is not about one request — it is about hundreds or thousands of concurrent requests.
KV Cache Memory at Scale (context = 4,096 tokens, FP16)
| Model | Batch=1 | Batch=32 | Batch=128 | Batch=256 |
|---|---|---|---|---|
| Llama 3 8B | 0.5 GB | 16 GB | 64 GB | 128 GB |
| Llama 3 70B | 1.25 GB | 40 GB | 160 GB | 320 GB |
| Llama 3 405B | 1.96 GB | 62.7 GB | 250.9 GB | 501.8 GB |
For Llama 3 8B at batch=128 with 4K context, the KV cache requires 64 GB — four times the 16 GB model weights. For Llama 3 70B at the same batch size, the KV cache requires 160 GB, exceeding the model weights (141 GB). At production batch sizes, the KV cache is the dominant memory consumer, not the model weights. This is why KV cache optimization (GQA, quantization, paged attention) is the most impactful area for serving efficiency.
With longer context lengths (32K, 128K, or the 128K context window that Llama 3 supports), the numbers become even more extreme. At 128K context with batch=32, Llama 3 8B would need GB of KV cache alone — an impossible number for a single GPU. This is why long-context serving requires aggressive KV cache compression techniques like quantization to INT8 or INT4, sliding window attention, or offloading to CPU memory.
Activation Memory
Activation memory is the working memory used during a forward pass to store intermediate results. Unlike model weights (which are constant) and the KV cache (which grows with sequence length), activation memory is proportional to the batch size and sequence length of the current forward pass.
During prefill, activation memory can be significant because the full prompt is processed at once. The dominant terms are:
- Attention scores: without FlashAttention, or with FlashAttention (which computes attention in tiles and does not materialize the full attention matrix).
- Layer intermediate outputs: per layer.
For Llama 3 8B with , : without FlashAttention, the attention matrix alone would be bytes GB per layer, times 32 layers GB. With FlashAttention, this drops to a few hundred MB total. FlashAttention is not optional for long-context inference.
During decode, activation memory is negligible because we process only one token per step. The intermediate tensors are tiny vectors rather than large matrices.
Total Memory Budget
Let us put it all together for a concrete deployment scenario: Llama 3 70B on 2x A100-80GB (160 GB total) with FP16 weights, serving batch=32 at 4K context.
GPU Memory Breakdown: Llama 3 70B on 2x A100-80GB (batch=32, 4K context)
(GB)We are 29 GB over budget! The solution space includes:
- Quantize weights to INT8: reduces weights from 141 GB to 70.6 GB, freeing 70 GB. Total becomes 119 GB — fits with headroom.
- Quantize KV cache to INT8: reduces KV from 40 GB to 20 GB.
- Reduce batch size: batch=16 halves KV cache to 20 GB, total becomes 169 GB… still tight.
- Add more GPUs: 4x A100 gives 320 GB but increases communication overhead.
In practice, production deployments use a combination of INT8 or INT4 weight quantization, INT8 KV cache quantization, and enough GPUs to provide headroom for burst traffic. The memory budget is always the binding constraint.
Batch Size Effects: How Batching Changes Everything
Batching is the single most important lever in LLM inference performance. It transforms decode from a catastrophically underutilized operation into something approaching reasonable GPU utilization.
The Mechanics of Batched Decode
When we batch decode requests together, the forward pass processes tokens simultaneously. The FFN computation changes from a matrix-vector product to a matrix-matrix product:
- Unbatched: — matrix-vector
- Batched (B): — matrix-matrix (small GEMM)
The weight matrix is loaded once from HBM regardless of batch size. But with batch size , we perform times more FLOPs with that single read.
Arithmetic intensity of batched decode:
At : FLOP/byte (memory-bound). At : FLOP/byte (still memory-bound on A100, but much better). At : FLOP/byte (at the ridge point for A100 — balanced). At : FLOP/byte (at the ridge point for H100 — balanced). At : Compute-bound on H100 — tensor cores become the bottleneck.
Arithmetic Intensity vs Batch Size During Decode
(FLOP/byte)Throughput vs Latency Tradeoff
Increasing batch size improves throughput (total tokens generated per second across all requests) but degrades per-request latency (time for an individual token). Here is why:
- Throughput: Weight reads are amortized across requests. Total throughput scales roughly linearly with until we hit the compute ceiling or run out of memory.
- Latency: With larger batches, each forward pass takes longer because there is more compute to do. The KV cache also grows, making the attention computation slower. For a single user, their tokens come slower.
This is the fundamental tension in LLM serving: maximizing throughput for the provider vs minimizing latency for the user.
Batch Size Impact on Throughput and Latency (Llama 3 8B, A100-80GB, FP16)
| Batch Size | Throughput (tok/s) | Per-Request Latency (ms/tok) | GPU Compute Util | KV Cache (GB, 2K ctx) |
|---|---|---|---|---|
| 1 | ~140 | ~7 | less than 1% | 0.25 |
| 4 | ~540 | ~7.4 | ~3% | 1.0 |
| 16 | ~2,000 | ~8 | ~10% | 4.0 |
| 32 | ~3,600 | ~8.9 | ~20% | 8.0 |
| 64 | ~5,800 | ~11 | ~40% | 16.0 |
| 128 | ~7,200 | ~17.8 | ~75% | 32.0 |
| 256 | ~7,800 | ~32.8 | ~95% | 64.0 |
Several things to notice:
- Throughput scales almost linearly from batch 1 to batch 64, then starts to plateau as compute utilization approaches the ceiling.
- Per-request latency stays relatively flat up to batch 32 (the operation is still memory-bound, so adding work is “free”), then starts rising as we approach the compute ridge point.
- KV cache memory scales linearly with batch size — at batch 256, it is 64 GB, consuming most of the A100’s memory.
- The “sweet spot” for serving is typically in the range of batch 32-64, where throughput is high but latency is still acceptable and KV cache has not consumed all memory.
The Memory Wall
The maximum achievable batch size is constrained by GPU memory:
For Llama 3 8B on A100-80GB with FP16 weights and 4K context:
For Llama 3 70B on 2x A100-80GB with FP16 weights and 4K context:
Only 10 concurrent requests! This is why weight quantization is so important for larger models — it frees memory for KV cache, enabling larger batch sizes, which improves throughput. With INT8 weights (70.6 GB), the budget becomes:
Quantizing weights from FP16 to INT8 increased max batch size from 10 to 66 — a improvement in maximum throughput, just from weight quantization. This is why quantization is listed as a serving optimization, not just a model compression technique.
Weight quantization is not just about making models smaller. In a serving context, its primary benefit is freeing GPU memory for KV cache, which enables larger batch sizes, which dramatically improves throughput. Reducing Llama 3 70B from FP16 to INT4 weights frees ~106 GB across 2 GPUs — enough for ~85 additional concurrent 4K-context requests.
The Serving Problem
Moving from a single inference request to a production serving system introduces a cascade of new challenges. Naive inference wastes enormous amounts of GPU resources, and the key innovations in LLM serving are all about closing this efficiency gap.
Why Naive Inference Wastes GPU
Consider a naive serving setup: a queue of requests, processed one at a time. Each request goes through prefill, then generates tokens until completion (or max length), then the next request starts.
The problems are severe:
-
Zero batching during decode: Each decode step reads the full model weights for a single token. On A100, this means compute utilization during generation — the most expensive hardware in the world is sitting idle.
-
Wasted prefill capacity: While one request is decoding, the GPU could be prefilling new requests (prefill is compute-bound and decode is memory-bound — they use different resources). But naive sequential processing cannot overlap them.
-
Variable request lengths: Requests have different prompt lengths and generate different numbers of output tokens. With static batching, the entire batch waits for the longest request to finish. If one request generates 10 tokens and another generates 500, the GPU is idle for of the time after the short request finishes.
-
Memory fragmentation: Different requests occupy different amounts of KV cache memory. As requests arrive and complete, the KV cache develops holes — allocated but unused memory regions that cannot be reclaimed for new requests.
Static Batching: The First Improvement
The simplest optimization is to batch multiple requests together: collect requests, prefill them all, then decode them all simultaneously. This improves throughput proportionally to the batch size (as discussed above).
But static batching has a critical flaw: all requests in a batch must start and end together. When the first request in a batch finishes generating, its slot sits empty until the entire batch completes. In practice, output lengths vary widely (some responses are 20 tokens, others are 2,000), so the average utilization of a static batch can be as low as -.
Continuous Batching: The Key Innovation
Continuous batching (also called iteration-level scheduling or inflight batching) solves this by making scheduling decisions at each decode step rather than at the batch level.
The core idea:
- After each decode iteration, check if any requests in the batch have finished (hit EOS token or max length).
- If so, immediately evict them and insert waiting requests from the queue.
- The new requests go through prefill (which can be interleaved or chunked) and then join the decode batch.
This means the batch stays full continuously — as soon as one request finishes, another takes its place. Utilization jumps from - to - or higher.
The engineering challenge is managing the KV cache for a dynamically changing set of requests. When a request finishes and a new one starts, the new request’s KV cache must be allocated and the old one freed, without fragmenting the contiguous memory blocks that GPU kernels require.
Paged Attention and KV Cache Management
PagedAttention, introduced by the vLLM system, applies the operating system concept of virtual memory paging to KV cache management. Instead of allocating a contiguous block of memory for each request’s KV cache (which leads to internal fragmentation when requests have variable lengths), PagedAttention:
- Divides KV cache memory into fixed-size blocks (analogous to memory pages).
- Maintains a block table mapping each request’s logical KV positions to physical memory blocks.
- Allocates blocks on demand as sequences grow, and frees them immediately when sequences complete.
This eliminates nearly all internal fragmentation and enables near-optimal memory utilization. In practice, PagedAttention increases the effective KV cache capacity by - compared to contiguous allocation, enabling proportionally larger batch sizes.
The tradeoff is a small overhead from the indirection (looking up block tables during attention computation), but this is negligible compared to the throughput gains from better memory utilization.
Serving Strategy Comparison (Llama 3 8B, A100, mixed workload)
| Strategy | Avg Throughput (tok/s) | GPU Utilization | Memory Waste | Scheduling Overhead |
|---|---|---|---|---|
| Sequential (B=1) | 140 | less than 1% | None | None |
| Static batching (B=32) | 3,600 | ~20% | 30-50% | Low |
| Continuous batching (B=32) | 5,400 | ~30% | 10-20% | Medium |
| Continuous + PagedAttention | 7,000 | ~45% | less than 5% | Medium |
| + Chunked prefill + overlap | 8,500 | ~55% | less than 5% | High |
Chunked Prefill and Prefill-Decode Overlap
Another important serving optimization is chunked prefill: instead of processing the entire prompt in a single forward pass (which can cause latency spikes for long prompts), split the prompt into chunks and process each chunk in a separate iteration. This allows:
- Interleaving with decode: While processing a chunk of a new request’s prompt, the decode tokens from existing requests can be included in the same forward pass. This prevents prefill from stalling ongoing decode requests.
- Bounded latency: No single iteration takes longer than a fixed time (determined by chunk size), making TTFT more predictable.
- Better GPU utilization: Each iteration has a mix of prefill tokens (compute-heavy) and decode tokens (memory-heavy), which better utilizes both compute and memory bandwidth simultaneously.
The optimal chunk size balances prefill throughput (larger chunks = fewer overheads) against decode latency (larger chunks = longer iterations that delay decode tokens). Typical chunk sizes are 256-1024 tokens.
Request Scheduling and Preemption
In a production system with limited GPU memory, sometimes accepting a new request requires evicting (preempting) a partially completed request. The evicted request’s KV cache can be:
- Dropped: The request is re-queued and must restart from scratch (re-prefill). Simple but wasteful.
- Swapped to CPU memory: The KV cache is copied to CPU RAM (much larger but slower). When GPU memory becomes available, the cache is swapped back and decode resumes. This preserves work but adds swap latency.
- Recomputed on demand: A hybrid approach where the prompt portion of the KV cache is recomputed (via prefill) but only on the portions needed for resuming generation.
Modern serving systems like vLLM, TensorRT-LLM, and SGLang implement sophisticated scheduling policies that balance throughput, latency, and fairness across concurrent requests.
A modern LLM serving system combines multiple techniques: continuous batching for utilization, PagedAttention for memory efficiency, chunked prefill for latency predictability, weight quantization for capacity, and KV cache quantization for batch size. Each technique addresses a different bottleneck, and they are multiplicatively beneficial. This is why frameworks like vLLM, TensorRT-LLM, and SGLang exist — the engineering surface area is enormous.
Real Numbers: Latency and Throughput on A100 and H100
Theory is essential, but numbers are what matter for production decisions. The following benchmarks represent typical performance numbers for popular models on A100-80GB SXM and H100-80GB SXM GPUs, measured with optimized serving frameworks (vLLM, TensorRT-LLM) under realistic conditions. All numbers are approximate and vary with framework version, driver version, quantization calibration, and workload characteristics.
Hardware Comparison
A100 vs H100 SXM Specifications (Relevant to Inference)
| Specification | A100 SXM | H100 SXM | H100/A100 Ratio |
|---|---|---|---|
| HBM capacity | 80 GB (HBM2e) | 80 GB (HBM3) | 1.0x |
| HBM bandwidth | 2.0 TB/s | 3.35 TB/s | 1.67x |
| FP16 tensor TFLOPS | 312 | 989 | 3.17x |
| INT8 tensor TOPS | 624 | 1,978 | 3.17x |
| FP8 tensor TFLOPS | N/A | 1,978 | N/A |
| Ridge point (FP16) | 156 FLOP/byte | 295 FLOP/byte | 1.89x |
| NVLink bandwidth (per GPU) | 600 GB/s | 900 GB/s | 1.5x |
| TDP | 400W | 700W | 1.75x |
For memory-bound operations (decode), the H100’s advantage is primarily from its higher memory bandwidth. For compute-bound operations (prefill), the advantage is (FP16) or even higher with FP8.
Llama 3 8B: Single GPU Performance
Llama 3 8B Inference Performance (Single GPU, 2K input / 512 output tokens)
| Metric | A100 FP16 | A100 INT8 | H100 FP16 | H100 FP8 |
|---|---|---|---|---|
| TTFT (B=1) | ~45 ms | ~35 ms | ~18 ms | ~12 ms |
| ITL (B=1) | ~7.1 ms | ~5.8 ms | ~4.2 ms | ~3.5 ms |
| Decode tok/s (B=1) | ~140 | ~172 | ~238 | ~286 |
| Decode tok/s (B=32) | ~3,600 | ~4,200 | ~6,100 | ~7,800 |
| Decode tok/s (B=64) | ~5,800 | ~7,000 | ~9,800 | ~13,500 |
| Max batch (4K ctx) | ~120 | ~200 | ~120 | ~240 |
Key observations for the 8B model:
- At batch=1, throughput is determined almost entirely by memory bandwidth. The H100’s bandwidth advantage translates to a speedup in decode.
- INT8 quantization on A100 provides a throughput improvement at small batch sizes (from reduced memory reads) and at large batch sizes (from fitting more requests).
- FP8 on H100 provides the best overall performance: faster compute for prefill and reduced memory footprint for larger batches.
- The maximum batch size approximately doubles with quantization (halving the weight memory frees room for more KV cache).
Llama 3 70B: Multi-GPU Performance
Llama 3 70B Inference Performance (4K input / 1K output tokens)
| Config | TTFT (B=1) | ITL (B=1) | Throughput B=1 | Throughput B=32 | Max batch |
|---|---|---|---|---|---|
| 2x A100 FP16 (TP=2) | ~320 ms | ~15 ms | ~67 tok/s | ~1,400 tok/s | ~10 |
| 4x A100 INT8 (TP=4) | ~95 ms | ~6.5 ms | ~154 tok/s | ~3,800 tok/s | ~100 |
| 2x H100 FP16 (TP=2) | ~130 ms | ~9 ms | ~111 tok/s | ~2,400 tok/s | ~10 |
| 4x H100 FP8 (TP=4) | ~35 ms | ~3.5 ms | ~286 tok/s | ~7,200 tok/s | ~120 |
| 8x H100 FP8 (TP=8) | ~20 ms | ~2.2 ms | ~455 tok/s | ~12,000 tok/s | ~280 |
Key observations for the 70B model:
- Tensor parallelism (TP) is essential. On 2x A100 with FP16 weights, max batch is only ~10 — barely viable for production. The memory is almost entirely consumed by weights.
- 4x H100 with FP8 is the sweet spot for many production deployments: good single-request latency (~3.5 ms/tok ITL), strong throughput at batch 32 (7,200 tok/s), and enough memory for 120 concurrent requests.
- 8x H100 provides diminishing returns per GPU but offers the lowest absolute latency (2.2 ms/tok ITL) and highest throughput for latency-sensitive applications.
- TTFT varies dramatically with tensor parallelism: more GPUs = faster prefill (compute splits across GPUs, though NVLink communication adds overhead).
Llama 3 405B: The Scale Challenge
Llama 3 405B Inference Performance (2K input / 512 output tokens)
| Config | TTFT (B=1) | ITL (B=1) | Throughput B=1 | Throughput B=16 | Min GPUs |
|---|---|---|---|---|---|
| 16x A100 INT8 (TP=8, PP=2) | ~380 ms | ~22 ms | ~45 tok/s | ~600 tok/s | 16 |
| 8x H100 FP8 (TP=8) | ~120 ms | ~11 ms | ~91 tok/s | ~1,200 tok/s | 8 |
| 16x H100 FP8 (TP=8, PP=2) | ~75 ms | ~7 ms | ~143 tok/s | ~2,000 tok/s | 16 |
At 405B parameters, the operational reality is stark:
- Minimum 8 H100s (with FP8) just to fit the model in memory with room for a modest batch. With FP16, you need 16+ GPUs.
- Pipeline parallelism (PP) becomes necessary alongside tensor parallelism when model weights exceed the combined memory of a single NVLink domain (typically 8 GPUs). PP adds pipeline bubbles that reduce efficiency by -.
- Cost per token is extremely high. At tok/s on 16x H100 with batch=16, the cost is roughly higher per token compared to the 70B model on 4x H100 at batch=32.
- For most production use cases, the 70B model with good quantization provides a much better cost-performance tradeoff than the 405B model.
For Llama 3 8B: a single H100 (or A100 with INT8) handles most workloads comfortably. For Llama 3 70B: 4x H100 with FP8 is the production sweet spot. For Llama 3 405B: evaluate whether the quality improvement over 70B justifies - higher serving cost — for many applications, it does not.
Decode Throughput Scaling Curves
To illustrate how throughput scales with batch size on real hardware, here are measured curves for Llama 3 8B:
Llama 3 8B Decode Throughput vs Batch Size (A100 FP16)
(tok/s)Llama 3 8B Decode Throughput vs Batch Size (H100 FP16)
(tok/s)The scaling pattern is consistent: near-linear throughput growth with batch size in the memory-bound regime, followed by a plateau as compute becomes the bottleneck. The H100’s higher memory bandwidth pushes the transition point to a higher batch size, and its higher compute ceiling raises the plateau.
Putting It All Together: The Mental Model
Let us synthesize everything into a coherent mental model for LLM inference performance.
The Three Resources
LLM inference performance is determined by three scarce resources:
- Compute (FLOPS): Tensor core throughput. Limits prefill speed and decode throughput at high batch sizes.
- Memory bandwidth (GB/s): HBM read throughput. Limits single-request decode speed and decode throughput at low-to-moderate batch sizes.
- Memory capacity (GB): Total HBM. Limits model size, maximum batch size (via KV cache), and maximum context length.
At any given moment, one of these is the binding constraint. The art of inference optimization is shifting the bottleneck to a different resource (typically from memory bandwidth to compute, via batching) or expanding the binding resource (more GPUs, faster memory, quantization to reduce capacity pressure).
Decision Framework
When evaluating an inference deployment, ask these questions in order:
1. Can the model fit in memory? Calculate: weights + KV cache (at target batch size and context length) + overhead. If this exceeds available GPU memory, you need more GPUs, quantization, or a smaller model.
2. What is the decode bottleneck? Calculate arithmetic intensity at your target batch size. If ridge point, you are memory-bandwidth-bound: faster memory or more batching helps; more compute does not. If ridge point, you are compute-bound: faster compute helps; more memory bandwidth does not.
3. What limits batch size? Usually KV cache memory. Calculate as shown above. If is below the ridge point, you will never fully utilize compute — consider KV cache quantization, GQA, or more memory to increase .
4. What are the latency requirements? TTFT is bounded by prefill time (proportional to prompt length, inversely proportional to compute throughput). ITL is bounded by decode step time (inversely proportional to memory bandwidth at low batch, or compute at high batch). If latency requirements are strict, you may need to sacrifice throughput (smaller batch) or add more GPUs (split work).
The Optimization Stack
Each optimization targets a specific bottleneck:
Inference Optimization Hierarchy
| Optimization | Target Bottleneck | Typical Impact | When to Apply |
|---|---|---|---|
| Batching (B=1 to B=32+) | BW utilization | 10-30x throughput | Always -- this is non-negotiable |
| Continuous batching | GPU utilization | 2-3x throughput | Any multi-request serving |
| Weight quantization (FP16 to INT4/8) | Memory capacity | 2-4x batch capacity | When memory-constrained |
| KV cache quantization | Memory capacity | 1.5-2x batch capacity | Long context or large batch |
| PagedAttention | Memory fragmentation | 2-4x effective capacity | Variable-length serving |
| FlashAttention | Prefill compute + memory | 2-4x prefill speed | Always for long prompts |
| Tensor parallelism | Single-GPU limits | ~Nx (N GPUs) w/ overhead | Model exceeds 1 GPU |
| Speculative decoding | Decode latency | 1.5-3x per-request speed | Latency-sensitive apps |
| Prefix caching | Redundant prefill | 2-10x TTFT for repeated prefixes | RAG, system prompts |
| FP8 (H100/B200) | Compute + capacity | 1.5-2x over FP16 | When hardware supports it |
The Cost Equation
Ultimately, inference is about cost per token. The cost equation is:
Every optimization that increases throughput (without proportionally increasing GPU cost) reduces cost per token. This is why batching is so powerful: it increases throughput by - with zero additional hardware cost.
For a concrete example: Llama 3 70B on 4x H100 (cost: \sim\16$/hour in cloud pricing):
- Batch=1: tok/s, cost = \16 / 3600 / 286 \approx $0.0000155$15.5$ per million tokens)
- Batch=32: tok/s, cost = \16 / 3600 / 7200 \approx $0.00000062$0.62$ per million tokens)
Batching reduced cost by . This is why API providers can offer LLM inference at prices that seem impossibly low — they are batching aggressively across thousands of concurrent requests.
Conclusion
LLM inference is defined by a fundamental asymmetry: prefill is compute-bound (large matrix multiplications that saturate tensor cores), while decode is memory-bandwidth-bound (full model weight reads for each single generated token). This asymmetry, rooted in the autoregressive nature of language generation, drives every design decision in the inference stack.
The key quantitative relationships to internalize:
- Arithmetic intensity during decode = batch size (in FLOP/byte). At batch=1, utilization is below . The ridge point for A100 is , for H100 is .
- KV cache scales as . It dominates memory at production batch sizes.
- Maximum batch size is memory-limited, and quantization’s primary serving benefit is freeing memory for larger batches.
- Throughput scales linearly with batch size until compute saturation, then plateaus.
The serving innovations — continuous batching, PagedAttention, chunked prefill, KV cache quantization — all serve a single goal: maximize the number of concurrent requests the GPU can process, thereby amortizing the fixed cost of reading model weights across more useful output tokens.
If you take away one thing from this post: LLM decode performance is not about making the GPU compute faster. It is about giving the GPU more useful work to do with each byte it reads from memory. Every optimization in the inference stack — from batching to quantization to paged memory management — is ultimately in service of this principle.