gpu programming advanced

CUDA Graphs for LLM Inference: Killing Startup Latency and Python Overhead

A systems-focused guide to CUDA Graphs for inference: what they actually capture, how much Python and launch overhead they remove, and when graphs fail to help because shapes or control flow keep changing.

July 7, 2020

19 min read

Stanley Phoong

Single-token LLM inference on an A100 should take 7ms for Llama 70B — that is the time to read 140 GB of weights from HBM at 20 TB/s. Instead, it takes 12ms because you are spending 5ms on Python overhead, dynamic shape checks, and launching 40+ tiny kernels. CUDA Graphs fix this by capturing the entire kernel launch sequence once and replaying it with near-zero CPU overhead — cutting that 5ms down to under 0.5ms. For batch=1 decode, graphs are the difference between 140 tokens/sec and 200 tokens/sec with identical GPU compute.

What CUDA Graphs actually capture

A graph records a DAG of kernel launches, memcpy operations, and dependencies between them. On replay, the CPU issues ONE graph launch instead of many individual kernel dispatches.

📊

Where CUDA Graphs help

Workload	Graph-friendly?	Why / why not?
Fixed-shape decode step	Yes	Same kernels, same shapes each token
Variable batch / seq len per token	Partial	Need padding or bucketing
Control-flow heavy kernels	No	Graph needs static structure

Baseline: many launches per token

With 30-60 small kernels per decode step, launch overhead alone can be a few milliseconds:

def decode_step_naive(model, tokens, kv_cache):
    hidden = model.embed(tokens)
    for layer in model.layers:
        hidden, kv_cache[layer] = layer(hidden, kv_cache[layer])
    logits = model.lm_head(hidden[:, -1, :])
    return logits

Capturing a graph for a fixed batch/sequence

import torch

def capture_graph(model, batch_size, max_seq, device="cuda"):
    static_tokens = torch.zeros(batch_size, max_seq, dtype=torch.long, device=device)
    stream = torch.cuda.Stream()
    g = torch.cuda.CUDAGraph()

    # Warm-up
    with torch.cuda.stream(stream):
        static_logits = model(static_tokens)
    torch.cuda.synchronize()

    # Capture
    with torch.cuda.graph(g):
        static_logits = model(static_tokens)

    return g, static_tokens, static_logits

def run_step_with_graph(g, static_tokens, static_logits, tokens):
    static_tokens[: tokens.shape[0], : tokens.shape[1]].copy_(tokens)
    g.replay()
    return static_logits[: tokens.shape[0]]

Performance impact

📊

Per-token latency: before vs after graphs

Setup	Batch	ms/token	Speedup
Eager, Python	1	15.2	1.0x
CUDA Graph	1	9.1	1.67x
Eager, Python	8	7.8	1.0x
CUDA Graph	8	5.0	1.56x

Launch overhead fraction vs batch size

line

Metric	1	2	4	8	16
Launch + Python fraction of step time	0.45	0.32	0.22	0.16	0.12

When graphs do not help (or hurt)

Changing shapes every step — you must pad to fixed shapes or maintain multiple graphs per bucket
Dynamic control flow inside kernels — not capturable
Frequent model changes (LoRA swapping) — frequent recapture costs

⚠️ Graphs and KV cache

Graphs assume static buffer layout. If your KV allocator moves buffers around or changes shapes per step, you must stabilize it (e.g., with a pool or paged allocator) before graphs pay off.

Practical guidelines

Start with fixed batch/sequence for the hot path (e.g., batch=1 streaming)
Capture separate graphs for a few common shapes (buckets)
Use graphs for the decode loop, not just prefill
Couple graphs with KV cache pooling so addresses stay stable

Conclusion

CUDA Graphs do not make the model itself faster — they remove the tax you pay around it. In low-latency LLM serving, that tax is often the difference between instant and laggy. Use graphs where control-flow and shapes are stable, and let eager execution handle the irregular tail.

Stanley Phoong

Performance engineer obsessed with every microsecond. Specializing in vLLM internals and bare-metal microcontroller optimization.

What CUDA Graphs actually capture

Where CUDA Graphs help

Baseline: many launches per token

Capturing a graph for a fixed batch/sequence

Performance impact

Per-token latency: before vs after graphs

Launch overhead fraction vs batch size

When graphs do not help (or hurt)

Practical guidelines

Conclusion

Stanley Phoong

Related Posts

CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles

CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling

CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads