Single-token LLM inference on an A100 should take 7ms for Llama 70B โ that is the time to read 140 GB of weights from HBM at 20 TB/s. Instead, it takes 12ms because you are spending 5ms on Python overhead, dynamic shape checks, and launching 40+ tiny kernels. CUDA Graphs fix this by capturing the entire kernel launch sequence once and replaying it with near-zero CPU overhead โ cutting that 5ms down to under 0.5ms. For batch=1 decode, graphs are the difference between 140 tokens/sec and 200 tokens/sec with identical GPU compute.
What CUDA Graphs actually capture
A graph records a DAG of kernel launches, memcpy operations, and dependencies between them. On replay, the CPU issues ONE graph launch instead of many individual kernel dispatches.
Where CUDA Graphs help
| Workload | Graph-friendly? | Why / why not? |
|---|---|---|
| Fixed-shape decode step | Yes | Same kernels, same shapes each token |
| Variable batch / seq len per token | Partial | Need padding or bucketing |
| Control-flow heavy kernels | No | Graph needs static structure |
Baseline: many launches per token
With 30-60 small kernels per decode step, launch overhead alone can be a few milliseconds:
def decode_step_naive(model, tokens, kv_cache):
hidden = model.embed(tokens)
for layer in model.layers:
hidden, kv_cache[layer] = layer(hidden, kv_cache[layer])
logits = model.lm_head(hidden[:, -1, :])
return logits
Capturing a graph for a fixed batch/sequence
import torch
def capture_graph(model, batch_size, max_seq, device="cuda"):
static_tokens = torch.zeros(batch_size, max_seq, dtype=torch.long, device=device)
stream = torch.cuda.Stream()
g = torch.cuda.CUDAGraph()
# Warm-up
with torch.cuda.stream(stream):
static_logits = model(static_tokens)
torch.cuda.synchronize()
# Capture
with torch.cuda.graph(g):
static_logits = model(static_tokens)
return g, static_tokens, static_logits
def run_step_with_graph(g, static_tokens, static_logits, tokens):
static_tokens[: tokens.shape[0], : tokens.shape[1]].copy_(tokens)
g.replay()
return static_logits[: tokens.shape[0]]
Performance impact
Per-token latency: before vs after graphs
| Setup | Batch | ms/token | Speedup |
|---|---|---|---|
| Eager, Python | 1 | 15.2 | 1.0x |
| CUDA Graph | 1 | 9.1 | 1.67x |
| Eager, Python | 8 | 7.8 | 1.0x |
| CUDA Graph | 8 | 5.0 | 1.56x |
Launch overhead fraction vs batch size
line| Metric | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| Launch + Python fraction of step time |
When graphs do not help (or hurt)
- Changing shapes every step โ you must pad to fixed shapes or maintain multiple graphs per bucket
- Dynamic control flow inside kernels โ not capturable
- Frequent model changes (LoRA swapping) โ frequent recapture costs
Graphs assume static buffer layout. If your KV allocator moves buffers around or changes shapes per step, you must stabilize it (e.g., with a pool or paged allocator) before graphs pay off.
Practical guidelines
- Start with fixed batch/sequence for the hot path (e.g., batch=1 streaming)
- Capture separate graphs for a few common shapes (buckets)
- Use graphs for the decode loop, not just prefill
- Couple graphs with KV cache pooling so addresses stay stable
Conclusion
CUDA Graphs do not make the model itself faster โ they remove the tax you pay around it. In low-latency LLM serving, that tax is often the difference between instant and laggy. Use graphs where control-flow and shapes are stable, and let eager execution handle the irregular tail.