Single-token LLM inference on an A100 should take 7ms for Llama 70B โ€” that is the time to read 140 GB of weights from HBM at 20 TB/s. Instead, it takes 12ms because you are spending 5ms on Python overhead, dynamic shape checks, and launching 40+ tiny kernels. CUDA Graphs fix this by capturing the entire kernel launch sequence once and replaying it with near-zero CPU overhead โ€” cutting that 5ms down to under 0.5ms. For batch=1 decode, graphs are the difference between 140 tokens/sec and 200 tokens/sec with identical GPU compute.

What CUDA Graphs actually capture

A graph records a DAG of kernel launches, memcpy operations, and dependencies between them. On replay, the CPU issues ONE graph launch instead of many individual kernel dispatches.

๐Ÿ“Š

Where CUDA Graphs help

WorkloadGraph-friendly?Why / why not?
Fixed-shape decode step Yes Same kernels, same shapes each token
Variable batch / seq len per token Partial Need padding or bucketing
Control-flow heavy kernels No Graph needs static structure

Baseline: many launches per token

With 30-60 small kernels per decode step, launch overhead alone can be a few milliseconds:

def decode_step_naive(model, tokens, kv_cache):
    hidden = model.embed(tokens)
    for layer in model.layers:
        hidden, kv_cache[layer] = layer(hidden, kv_cache[layer])
    logits = model.lm_head(hidden[:, -1, :])
    return logits

Capturing a graph for a fixed batch/sequence

import torch

def capture_graph(model, batch_size, max_seq, device="cuda"):
    static_tokens = torch.zeros(batch_size, max_seq, dtype=torch.long, device=device)
    stream = torch.cuda.Stream()
    g = torch.cuda.CUDAGraph()

    # Warm-up
    with torch.cuda.stream(stream):
        static_logits = model(static_tokens)
    torch.cuda.synchronize()

    # Capture
    with torch.cuda.graph(g):
        static_logits = model(static_tokens)

    return g, static_tokens, static_logits

def run_step_with_graph(g, static_tokens, static_logits, tokens):
    static_tokens[: tokens.shape[0], : tokens.shape[1]].copy_(tokens)
    g.replay()
    return static_logits[: tokens.shape[0]]

Performance impact

๐Ÿ“Š

Per-token latency: before vs after graphs

SetupBatchms/tokenSpeedup
Eager, Python 1 15.2 1.0x
CUDA Graph 1 9.1 1.67x
Eager, Python 8 7.8 1.0x
CUDA Graph 8 5.0 1.56x

Launch overhead fraction vs batch size

line
Metric 124816
Launch + Python fraction of step time
0.45
0.32
0.22
0.16
0.12

When graphs do not help (or hurt)

  • Changing shapes every step โ€” you must pad to fixed shapes or maintain multiple graphs per bucket
  • Dynamic control flow inside kernels โ€” not capturable
  • Frequent model changes (LoRA swapping) โ€” frequent recapture costs
โš ๏ธ Graphs and KV cache

Graphs assume static buffer layout. If your KV allocator moves buffers around or changes shapes per step, you must stabilize it (e.g., with a pool or paged allocator) before graphs pay off.

Practical guidelines

  • Start with fixed batch/sequence for the hot path (e.g., batch=1 streaming)
  • Capture separate graphs for a few common shapes (buckets)
  • Use graphs for the decode loop, not just prefill
  • Couple graphs with KV cache pooling so addresses stay stable

Conclusion

CUDA Graphs do not make the model itself faster โ€” they remove the tax you pay around it. In low-latency LLM serving, that tax is often the difference between instant and laggy. Use graphs where control-flow and shapes are stable, and let eager execution handle the irregular tail.