Part of Series Inference Optimization Timeline 8 of 23
1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Every request to a production LLM starts with the same work: computing the KV cache for the system prompt. A chatbot with a 2,000-token system prompt recomputes those same key-value pairs for every single user message. A RAG pipeline with a shared retrieval preamble does the same. A coding assistant with a standardized instruction block does the same. The prefill computation is identical across thousands of requests per second, and you pay for it every time.

Prefix caching eliminates this redundancy. Instead of recomputing KV cache entries for tokens you have already processed, you store them and reuse them when the same prefix appears again. The idea sounds simple, but the implementation involves non-trivial data structures, multi-tier memory hierarchies, eviction policies, and routing decisions that interact with every other part of the serving system.

This post covers the full landscape: the quantitative case for prefix caching, vLLM’s hash-based automatic prefix caching, SGLang’s RadixAttention with its radix tree for hierarchical prefix matching, multi-tier cache hierarchies spanning GPU HBM to CPU DRAM to SSD, multi-turn conversation caching, cache-aware request routing, the failure modes where prefix caching hurts, and production benchmarks.


1. The Problem: Redundant Prefill Computation

The Economics of System Prompts

Modern LLM deployments almost universally use system prompts. A typical ChatGPT-style system prompt is 500-2,000 tokens. Enterprise deployments with detailed instructions, tool definitions, and persona specifications often reach 4,000-8,000 tokens. Every request must process these tokens during prefill before it can begin processing the user’s actual query.

Let us put numbers on this. For Llama-2-70B on an A100-80GB:

  • Prefill throughput: roughly 2,000-4,000 tokens/second (depending on batch size and sequence length)
  • System prompt length: 2,000 tokens
  • Prefill time for system prompt: ~1-2.5 ms per token, so roughly 5 ms total for the system prompt alone

At first glance, 5 ms sounds trivial. But at scale:

Wasted compute per second=QPS×prefill time per prompt=1000×5 ms=5 GPU-seconds/s\text{Wasted compute per second} = \text{QPS} \times \text{prefill time per prompt} = 1000 \times 5\text{ ms} = 5\text{ GPU-seconds/s}

You are spending 5 GPU-seconds every second just recomputing the same system prompt KV cache. That is 5 entire GPUs worth of prefill capacity, burned on redundant work. At 1,000 QPS with an 8-GPU node, that is 62.5% of your prefill capacity wasted on tokens you have already processed.

GPU-Seconds Wasted on Redundant System Prompt Prefill

(GPU-seconds/s)
100 QPS Manageable
0.5 GPU-seconds/s
500 QPS Noticeable
2.5 GPU-seconds/s
1,000 QPS Significant
5 GPU-seconds/s
5,000 QPS Dominating
25 GPU-seconds/s
10,000 QPS Catastrophic
50 GPU-seconds/s

Beyond System Prompts

Redundant prefixes appear in many places beyond system prompts:

  • Multi-turn conversations: each turn shares the entire conversation history as a prefix. Turn 5 of a conversation has turns 1-4 as a prefix that was already computed during turn 4.
  • RAG pipelines: many queries retrieve the same popular documents. The prefill for those documents is identical.
  • Few-shot prompting: the few-shot examples are the same across all requests. Only the query at the end changes.
  • Batch processing: when processing a batch of items with the same instructions, every item shares the instruction prefix.
  • A/B testing: different model versions or prompt variants still share large common prefixes.

The fundamental observation is that LLM request prefixes have high entropy at the tail but low entropy at the head. The system prompt, instruction block, and shared context are nearly identical across requests. The user-specific content comes at the end. Prefix caching exploits this asymmetry.

The Prefix Sharing Opportunity

In production deployments, 60-90% of prefill tokens are shared across requests within a short time window. For enterprise chatbots with standardized system prompts, the sharing rate often exceeds 95%. Every shared token that is cached is a token you do not need to recompute.

What Prefix Caching Actually Saves

When you reuse a cached KV prefix, you skip the entire prefill computation for those tokens. The savings are:

  1. Compute: no QKV projections, no attention computation, no FFN forward pass for cached tokens
  2. Time-to-first-token (TTFT): the user sees the first generated token faster because prefill is shorter
  3. GPU utilization: freed prefill capacity can be used for new requests, increasing throughput

The cost you pay is memory: you must store the KV cache entries somewhere. For Llama-2-70B with FP16 KV cache, each token costs approximately 320 KB across all layers. A 2,000-token system prompt cache consumes ~640 MB of GPU memory per cached prefix. The tradeoff is memory for compute — a classic caching problem.


2. Automatic Prefix Caching (vLLM)

vLLM introduced automatic prefix caching (APC) as a built-in feature. The design is elegant in its simplicity: hash the token content of each KV block, and if the hash matches a previously computed block, reuse it instead of recomputing.

How It Works

Recall that vLLM’s PagedAttention divides KV cache into fixed-size blocks, each holding a fixed number of tokens (typically 16). Automatic prefix caching extends this with a hash table:

  1. Block hashing: For each KV block, compute a hash based on the token IDs contained in that block plus the hash of the preceding block. This creates a chain of hashes that uniquely identifies the prefix up to that point.

  2. Hash lookup: When a new request arrives, hash its tokens block by block. For each block, check if the hash exists in the cache. If yes, reuse the physical KV cache block. If no, allocate a new block and compute the KV entries.

  3. Prefix matching: Because hashes are chained, a match at block kk implies matches at blocks 00 through k1k-1. The matching stops at the first miss.

def compute_block_hash(token_ids: List[int], prev_hash: int) -> int:
    """Hash a KV block based on its tokens and the previous block's hash.

    The chaining ensures that two blocks with the same tokens but different
    prefixes produce different hashes.
    """
    block_content = tuple(token_ids)
    return hash((prev_hash, block_content))

def find_cached_prefix(request_tokens: List[int], block_size: int,
                       hash_table: Dict[int, PhysicalBlock]) -> int:
    """Find the longest cached prefix for a request.

    Returns the number of tokens that can be reused from cache.
    """
    prev_hash = INITIAL_HASH
    cached_blocks = 0

    for i in range(0, len(request_tokens), block_size):
        block_tokens = request_tokens[i:i + block_size]
        if len(block_tokens) < block_size:
            break  # Partial block, cannot cache

        block_hash = compute_block_hash(block_tokens, prev_hash)

        if block_hash in hash_table:
            cached_blocks += 1
            prev_hash = block_hash
        else:
            break  # First miss, stop matching

    return cached_blocks * block_size

Hash Chain Design

The critical design decision is chaining the hashes. A naive approach would hash each block independently, but this is incorrect. Consider two sequences:

  • Sequence A: [system_prompt] [user_message_1]
  • Sequence B: [different_prompt] [user_message_1]

If user_message_1 happens to fall at the same block boundary and contains the same tokens, an independent hash would incorrectly match the KV blocks. But the KV values are different because the attention computation depends on all preceding tokens. The hash chain captures this dependency: the hash of block kk depends on the hash of block k1k-1, which depends on block k2k-2, and so on back to block 0.

Hk=hash(Hk1,tokensk)H_k = \text{hash}(H_{k-1}, \text{tokens}_k)

This gives us a property similar to a Merkle chain: a match at position kk guarantees correctness of the entire prefix.

Block Boundary Alignment

Hash-based prefix caching only matches at block boundaries. If the block size is 16, and two requests share a 1,000-token prefix, the cache can reuse 1000/16=62\lfloor 1000 / 16 \rfloor = 62 blocks = 992 tokens. The remaining 8 tokens must be recomputed.

This creates a subtle interaction with prompt design. If your system prompt is 2,000 tokens, you get perfect block alignment. If it is 2,003 tokens, the last 3-token partial block cannot be cached, and worse, it shifts all subsequent blocks out of alignment with other requests that have different-length prefixes after the system prompt.

💡 Practical Advice: Pad System Prompts to Block Boundaries

For maximum cache hit rates, ensure your system prompt length is a multiple of the KV block size (typically 16 tokens). If your prompt is 2,003 tokens, add 13 padding tokens (e.g., newlines) to reach 2,016. This costs ~0.6% extra compute on a miss but ensures every subsequent block aligns correctly.

Limitations of Hash-Based Matching

Hash-based APC has two key limitations:

  1. All-or-nothing at each block: if even one token differs in a block, the entire block misses. There is no partial block reuse.

  2. No longest-prefix matching across different prefixes: the hash table is a flat structure. Given a request, you can check if its specific prefix exists, but you cannot efficiently find “the closest prefix in the cache.” If you have cached prefix A (2,000 tokens) and prefix B (1,800 tokens), and a new request shares 1,500 tokens with A and 1,800 tokens with B, the hash-based approach finds the 1,500-token match with A by walking the chain — but it does not know about B unless it happens to check. In practice, the match is deterministic (the hash chain starting from the request’s tokens will match whichever cached chain shares its exact prefix), but the structure does not support finding the longest match among many cached prefixes efficiently.

These limitations motivated SGLang’s radix tree approach.


3. RadixAttention (SGLang): Hierarchical Prefix Matching

SGLang introduced RadixAttention, which replaces the flat hash table with a radix tree (also known as a compressed trie or Patricia tree). This data structure is fundamentally better suited to the prefix matching problem.

Why a Radix Tree?

A radix tree stores strings (in this case, token sequences) in a tree where:

  • Each edge represents a sequence of tokens (not just one token, which is the key difference from a plain trie)
  • The path from root to any node represents the full prefix up to that point
  • Shared prefixes are stored exactly once, with branching at the point of divergence

This gives three advantages over hash-based matching:

  1. Efficient longest-prefix lookup: given a new token sequence, traverse the tree from the root, following edges that match the input tokens. When you can no longer follow an edge, you have found the longest cached prefix. This is O(n)O(n) in the length of the input, with no need to check block boundaries.

  2. Partial prefix sharing: if two prefixes share the first 1,000 tokens but diverge after that, the tree stores the shared 1,000 tokens once and branches into two paths. The hash-based approach stores them twice (in two separate block chains).

  3. Multi-turn conversation support: a conversation with 5 turns naturally forms a path in the tree. Each new turn extends the path. If the user backtracks (e.g., regenerates turn 3), the tree branches at turn 3.

Worked Example: The Radix Tree in Action

Consider a serving system handling three types of requests:

  • Type A: System prompt (2,000 tokens) + User message 1 (100 tokens)
  • Type B: System prompt (2,000 tokens) + User message 2 (150 tokens)
  • Type C: Different system prompt (1,500 tokens) + User message 3 (200 tokens)

After processing these requests, the radix tree looks like:

Root
├── [System Prompt: 2000 tokens] ─── Node S
│   ├── [User Message 1: 100 tokens] ─── Leaf A
│   └── [User Message 2: 150 tokens] ─── Leaf B
└── [Different System Prompt: 1500 tokens] ─── Node D
    └── [User Message 3: 200 tokens] ─── Leaf C

Now a fourth request arrives: System prompt (2,000 tokens) + User message 4 (80 tokens). The tree lookup proceeds:

  1. Start at Root. Compare tokens against child edges.
  2. Match the “System Prompt” edge (2,000 tokens) — full match. Arrive at Node S.
  3. Compare remaining tokens (User Message 4) against Node S’s child edges. No match — “User Message 4” differs from both “User Message 1” and “User Message 2.”
  4. Result: reuse the KV cache for the 2,000-token system prompt. Only compute KV for the 80 new tokens.

The tree is updated to:

Root
├── [System Prompt: 2000 tokens] ─── Node S
│   ├── [User Message 1: 100 tokens] ─── Leaf A
│   ├── [User Message 2: 150 tokens] ─── Leaf B
│   └── [User Message 4: 80 tokens] ─── Leaf D  (NEW)
└── [Different System Prompt: 1500 tokens] ─── Node D
    └── [User Message 3: 200 tokens] ─── Leaf C
ℹ️ Radix Tree Node Splitting

When a new sequence partially matches an existing edge, the edge is split. For example, if User Message 5 shares the first 60 tokens with User Message 1 but diverges after that, the “User Message 1” edge is split into a 60-token shared edge and two child edges (one for the remaining 40 tokens of Message 1, one for the remaining tokens of Message 5). This is the standard radix tree split operation.

Multi-Turn Conversations in the Radix Tree

Multi-turn conversations are where RadixAttention truly shines. Consider a 4-turn conversation:

  • Turn 1: [System] [User1] [Assistant1]
  • Turn 2: [System] [User1] [Assistant1] [User2] [Assistant2]
  • Turn 3: [System] [User1] [Assistant1] [User2] [Assistant2] [User3] [Assistant3]
  • Turn 4: [System] [User1] [Assistant1] [User2] [Assistant2] [User3] [Assistant3] [User4]

Each turn is a strict prefix extension of the previous turn. In the radix tree, this is a single path that extends deeper with each turn. When Turn 4 arrives, the tree lookup finds that all tokens through the end of Turn 3 (the assistant’s response) are already cached. Only [User4] needs new KV computation.

With hash-based APC, this works too — as long as all tokens align to block boundaries. But the radix tree handles it more gracefully when token boundaries shift, and the tree structure makes it trivial to implement:

class RadixTreeNode:
    def __init__(self):
        self.children: Dict[int, 'RadixTreeNode'] = {}
        self.token_ids: List[int] = []
        self.kv_block_ids: List[int] = []  # References to physical KV blocks
        self.ref_count: int = 0  # Number of active requests using this node
        self.last_access: float = 0.0  # For LRU eviction

    def match_prefix(self, tokens: List[int]) -> Tuple['RadixTreeNode', int]:
        """Find the deepest node matching the given token sequence.

        Returns (node, matched_length).
        """
        matched = 0
        current = self

        while matched < len(tokens):
            next_token = tokens[matched]
            if next_token not in current.children:
                break

            child = current.children[next_token]
            edge_tokens = child.token_ids

            # Check how many tokens on this edge match
            edge_match = 0
            while (edge_match < len(edge_tokens) and
                   matched + edge_match < len(tokens) and
                   edge_tokens[edge_match] == tokens[matched + edge_match]):
                edge_match += 1

            matched += edge_match

            if edge_match == len(edge_tokens):
                current = child  # Full edge match, continue
            else:
                break  # Partial edge match

        return current, matched

Radix Tree vs. Hash Table: Performance Comparison

📊

Prefix Matching: Radix Tree (SGLang) vs. Hash Table (vLLM APC)

ScenarioHash Table MatchesRadix Tree MatchesTokens Saved (Radix)Advantage
Identical system prompt (2K tokens) 2,000 2,000 2,000 Equal
Multi-turn (turn 5, 8K history) 7,984 (block-aligned) 8,000 8,000 Slight edge
Partial prefix overlap (1,500/2,000) 1,488 (block-aligned) 1,500 1,500 Slight edge
Branching conversations (3 variants) Best single match All 3 paths cached Variable Structural advantage
Dynamic few-shot (varying examples) Misses on reordering Matches shared prefix Variable Structural advantage
Note: Block size = 16 for hash table. Radix tree does not require block alignment for matching.

For the common case of identical system prompts, both approaches perform similarly. The radix tree’s advantages emerge in more complex sharing patterns: branching conversations, dynamic prompt construction, and cases where block alignment matters.


4. Cache Hierarchies: GPU, CPU, and SSD Tiers

KV cache entries are large. A 2,000-token prefix for Llama-70B consumes ~640 MB. You cannot keep unlimited prefixes in GPU HBM. A cache hierarchy addresses this by storing hot prefixes on fast storage and cold prefixes on slower, cheaper storage.

The Three-Tier Architecture

KV Cache Hierarchy for Prefix Caching

0x3000 0x0000
0x8000 0x3000
0xF000 0x8000
GPU HBM (Hot Cache) 10-30 GB
CPU DRAM (Warm Cache) 64-256 GB
SSD (Cold Cache) 1-4 TB
Active prefixes, sub-microsecond access. Holds 15-50 prefixes for 70B model.
Recently evicted prefixes, ~1-5 ms to transfer back to GPU over PCIe.
Historical prefixes, ~10-50 ms to load. NVMe required for acceptable latency.
GPU HBM (Hot Cache) 10-30 GB
CPU DRAM (Warm Cache) 64-256 GB
SSD (Cold Cache) 1-4 TB

Each tier has different capacity, latency, and bandwidth characteristics:

Tier 1: GPU HBM (Hot)

  • Capacity: Whatever GPU memory remains after model weights and active request KV caches. For Llama-70B FP16 on an 80 GB A100, model weights consume ~35 GB, and active requests use 20-40 GB. That leaves 5-25 GB for cached prefixes.
  • Access latency: Essentially zero — the KV blocks are already where they need to be.
  • Best for: The most frequently accessed prefixes (e.g., the single system prompt used by 95% of requests).

Tier 2: CPU DRAM (Warm)

  • Capacity: 64-512 GB on a typical server. Can store hundreds of cached prefixes.
  • Access latency: Transfer from CPU to GPU over PCIe 4.0 x16 (32 GB/s bidirectional). A 640 MB prefix transfer takes ~20 ms. With PCIe 5.0, this halves to ~10 ms.
  • Best for: Prefixes accessed every few seconds. The transfer cost is amortized over many requests that benefit from the cached prefix.

Tier 3: SSD (Cold)

  • Capacity: 1-8 TB on NVMe. Can store thousands of cached prefixes.
  • Access latency: NVMe sequential read at 5-7 GB/s. A 640 MB prefix load takes ~90-130 ms. With pipelining and direct GPU access (GPUDirect Storage), this can be reduced to ~50 ms.
  • Best for: Infrequently accessed prefixes that are expensive to recompute. A 50 ms load from SSD is still much cheaper than a 500 ms prefill for an 8,000-token prefix.
When SSD Tier Makes Sense

The SSD tier is only worthwhile when the prefix recomputation cost exceeds the SSD load cost. For a 2,000-token prefix (5 ms prefill), loading from SSD (50-100 ms) is 10-20x worse than recomputing. For an 8,000-token prefix (50-100 ms prefill), the SSD load time is comparable. The breakeven is roughly 4,000-6,000 tokens, depending on GPU speed and SSD bandwidth.

Eviction Policies

Each tier operates with a bounded memory budget. When the tier is full and a new prefix needs to be cached, something must be evicted. The standard approach is LRU (Least Recently Used) within each tier, with promotion and demotion between tiers.

Eviction flow:

  1. A prefix in GPU HBM is not accessed for a threshold period. It is demoted to CPU DRAM: the KV blocks are copied to CPU memory, and the GPU blocks are freed.
  2. A prefix in CPU DRAM is not accessed for a longer threshold. It is demoted to SSD: the KV blocks are written to NVMe, and the CPU memory is freed.
  3. A prefix in SSD is not accessed for an extended period. It is evicted entirely: the SSD space is freed. The prefix must be recomputed from scratch if needed again.

Promotion flow:

  1. A request arrives that matches a prefix cached in CPU DRAM. The KV blocks are promoted (copied) to GPU HBM. The prefill for those tokens is skipped.
  2. A request matches a prefix cached on SSD. The KV blocks are loaded to GPU HBM (possibly via CPU DRAM as a staging area).
class CacheHierarchy:
    def __init__(self, gpu_budget_gb: float, cpu_budget_gb: float,
                 ssd_budget_gb: float):
        self.gpu_cache = LRUCache(capacity_bytes=int(gpu_budget_gb * 1e9))
        self.cpu_cache = LRUCache(capacity_bytes=int(cpu_budget_gb * 1e9))
        self.ssd_cache = LRUCache(capacity_bytes=int(ssd_budget_gb * 1e9))

    def lookup(self, prefix_hash: int) -> Tuple[Optional[KVBlocks], str]:
        """Look up a prefix across all tiers. Returns (blocks, tier)."""
        # Check GPU first (fastest)
        blocks = self.gpu_cache.get(prefix_hash)
        if blocks is not None:
            return blocks, "gpu"

        # Check CPU (medium)
        blocks = self.cpu_cache.get(prefix_hash)
        if blocks is not None:
            # Promote to GPU
            gpu_blocks = self.transfer_cpu_to_gpu(blocks)
            self.gpu_cache.put(prefix_hash, gpu_blocks)
            return gpu_blocks, "cpu"

        # Check SSD (slowest)
        blocks = self.ssd_cache.get(prefix_hash)
        if blocks is not None:
            # Promote to GPU (via CPU staging)
            gpu_blocks = self.transfer_ssd_to_gpu(blocks)
            self.gpu_cache.put(prefix_hash, gpu_blocks)
            return gpu_blocks, "ssd"

        return None, "miss"

    def evict_gpu(self, prefix_hash: int, blocks: KVBlocks):
        """Demote from GPU to CPU."""
        cpu_blocks = self.transfer_gpu_to_cpu(blocks)
        self.cpu_cache.put(prefix_hash, cpu_blocks)
        self.gpu_cache.remove(prefix_hash)

Transfer Costs: The Hidden Tax

Tier promotion is not free. The transfer latency becomes part of the TTFT for the request that triggers it.

📊

Prefix Cache Tier Transfer Costs (Llama-70B, 2K-Token Prefix = 640 MB)

Source TierTransfer BWTransfer Timevs. Recompute (5 ms)Worthwhile?
GPU HBM (hit) N/A 0 ms Saves 5 ms Always
CPU DRAM (PCIe 4.0) 32 GB/s ~20 ms Costs 15 ms extra No for 2K prefix
CPU DRAM (PCIe 5.0) 64 GB/s ~10 ms Costs 5 ms extra Breakeven
SSD (NVMe Gen4) 7 GB/s ~91 ms Costs 86 ms extra No for 2K prefix
SSD (NVMe Gen5) 14 GB/s ~46 ms Costs 41 ms extra No for 2K prefix
Note: Transfer times are for the full 640 MB KV cache. Amortized over N requests sharing the prefix, the per-request cost is transfer_time / N.

The key insight from this table: tier promotion is only worthwhile when the prefix is long enough or popular enough. A 2,000-token prefix that takes 5 ms to recompute is not worth loading from CPU DRAM (20 ms) unless you expect many subsequent requests to use it. The amortized cost is:

Amortized cost per request=Transfer timeNrequests sharing prefix\text{Amortized cost per request} = \frac{\text{Transfer time}}{N_{\text{requests sharing prefix}}}

For the CPU DRAM case: if 100 requests will share the promoted prefix, the amortized cost is 20 ms / 100 = 0.2 ms, which is much better than the 5 ms recompute. The critical parameter is the expected number of future requests that will use the prefix before it is evicted again.


5. Multi-Turn Conversation Caching

Multi-turn conversations are the most natural and impactful application of prefix caching. Each turn in a conversation is a strict extension of the previous turn’s context.

The Structure of Multi-Turn Requests

Consider a 5-turn conversation. Each request to the LLM looks like:

TurnTokens Sent to ModelNew TokensCached Prefix Tokens
1[Sys] + [U1] = 2,1002,1000
2[Sys] + [U1] + [A1] + [U2] = 2,5004002,100
3[Sys] + [U1] + [A1] + [U2] + [A2] + [U3] = 3,2007002,500
4[Sys] + [...history...] + [U4] = 4,0008003,200
5[Sys] + [...history...] + [U5] = 4,6006004,000

Without prefix caching, every turn recomputes the full context from scratch. Turn 5 recomputes 4,600 tokens, of which 4,000 were already computed during turn 4.

With prefix caching, turn 5 only computes KV for the 600 new tokens. The TTFT savings grow with each turn:

TTFT per Turn: With vs. Without Prefix Caching (Llama-70B, A100)

(ms)
Turn 1 (no cache) Cold start
52 ms
Turn 1 (cached sys) System prompt cached
5 ms
Turn 3 (no cache) 3,200 tokens
80 ms
Turn 3 (cached) Only 700 new tokens
18 ms
Turn 5 (no cache) 4,600 tokens
115 ms
Turn 5 (cached) Only 600 new tokens
15 ms

By turn 5, prefix caching reduces TTFT by 87% — from 115 ms to 15 ms. The user experiences dramatically more responsive interactions in later conversation turns.

Cache Invalidation in Multi-Turn

Multi-turn caching has a subtle correctness concern: the cached prefix is only valid if the conversation history is byte-for-byte identical. This means:

  1. No prompt rewriting: if the system modifies previous turns (e.g., summarizing old context, trimming messages), the cache is invalidated.
  2. Deterministic tokenization: the tokens for the cached prefix must be identical. If the tokenizer produces different tokens for the same text (rare but possible with some normalization settings), the cache misses.
  3. Temperature and sampling do not affect cache validity: the KV cache is computed during prefill, which is deterministic regardless of sampling parameters. Different sampling settings for the same prefix share the same KV cache.

Branching Conversations

Users sometimes regenerate a response (click “regenerate” in a chat UI). This creates a branching conversation:

[System] [User1] [Assistant1_v1] [User2] ...
                  [Assistant1_v2] [User3] ...

Both branches share the prefix [System] [User1], but diverge at the assistant response. The radix tree handles this naturally — it creates a branch at the divergence point. The hash-based approach also handles it correctly, since the different assistant tokens produce different block hashes.

The interesting optimization is speculative caching of the branch point. When a user sends a message that could plausibly lead to a regeneration, the system can keep the pre-assistant KV cache pinned in GPU memory for a short window, enabling instant prefix reuse if the user regenerates.


6. Cache-Aware Request Routing

In a multi-GPU or multi-node serving deployment, each GPU maintains its own prefix cache. When a new request arrives, the router must decide which GPU to send it to. The choice of GPU has a profound effect on cache hit rates.

The Routing Tradeoff

There are two competing objectives:

  1. Cache locality: route the request to the GPU that has its prefix cached. This maximizes cache hits and minimizes TTFT.
  2. Load balance: route the request to the least-loaded GPU. This prevents hotspots and maintains throughput.

These objectives conflict. If 90% of requests share the same system prompt, cache locality wants to send them all to GPU 0 (which has the cached prefix). But this overloads GPU 0 while GPUs 1-7 sit idle.

Cache-Aware Routing Strategies

Strategy 1: Prefix-Hash Routing

Compute a hash of the request’s prefix (or its first NN tokens) and use consistent hashing to route to a GPU. Requests with the same prefix always go to the same GPU. This maximizes cache hits for the dominant prefix but creates hotspots when one prefix is much more popular than others.

def route_by_prefix_hash(request, gpu_pool):
    """Route based on prefix hash. Good locality, poor balance."""
    prefix_hash = hash(tuple(request.tokens[:ROUTE_PREFIX_LEN]))
    gpu_idx = prefix_hash % len(gpu_pool)
    return gpu_pool[gpu_idx]

Strategy 2: Cache-Aware Load Balancing

Check which GPUs have the request’s prefix cached, then choose the least-loaded among those. If no GPU has the prefix, fall back to pure load balancing.

def route_cache_aware(request, gpu_pool, cache_registry):
    """Route to least-loaded GPU that has the prefix cached."""
    prefix_tokens = tuple(request.tokens[:request.system_prompt_len])

    # Find GPUs with cached prefix
    gpus_with_cache = [
        gpu for gpu in gpu_pool
        if cache_registry.has_prefix(gpu.id, prefix_tokens)
    ]

    if gpus_with_cache:
        # Choose least loaded among GPUs with cache
        return min(gpus_with_cache, key=lambda g: g.active_requests)
    else:
        # No cache hit possible, use pure load balancing
        return min(gpu_pool, key=lambda g: g.active_requests)

Strategy 3: Replicated Caching with Affinity Groups

Replicate popular prefixes across multiple GPUs (not all — a subset called the “affinity group”). Route requests to any GPU in the affinity group, balancing load within the group.

This is the approach most production systems converge on. For a system prompt used by 80% of traffic and an 8-GPU deployment:

  • GPUs 0-5 (6 GPUs): cache the primary system prompt. 80% of traffic is distributed across these 6 GPUs.
  • GPUs 6-7 (2 GPUs): handle requests with rare or unique prefixes. 20% of traffic goes here.

The affinity group size is chosen to balance cache memory overhead (each GPU in the group spends memory on the cached prefix) against load distribution.

📊

Routing Strategy Comparison (8 GPUs, 80% Shared Prefix Traffic)

StrategyCache Hit RateLoad ImbalanceAvg TTFTP99 TTFT
Round-robin (no cache awareness) 12.5% Low 52 ms 115 ms
Prefix-hash routing 95% Very high (8x) 8 ms (median) 280 ms (tail)
Cache-aware load balance 85% Moderate 14 ms 65 ms
Affinity groups (6+2 split) 92% Low 10 ms 48 ms
Note: Llama-70B on 8x A100-80GB. 1,000 QPS with 80% sharing a 2K-token system prompt.

The affinity group approach achieves the best balance: near-optimal cache hit rates with acceptable load distribution. Pure prefix-hash routing has the highest hit rate but terrible P99 latency because it creates severe hotspots.

⚠️ Routing Staleness

Cache-aware routing requires the router to know which GPUs have which prefixes cached. This information can become stale if the cache evicts a prefix between the routing decision and the request execution. In practice, systems maintain an approximate cache registry that is updated asynchronously, accepting occasional cache misses rather than adding synchronous registry lookups to the critical path.


7. When Prefix Caching Hurts

Prefix caching is not free. There are scenarios where it degrades performance rather than improving it.

Cache Fragmentation

Prefix caching reserves GPU memory for cached KV blocks. These blocks may not be contiguous, and they cannot be used for active request KV cache storage. If the cache holds many stale prefixes that are rarely reused, the memory they consume directly reduces the number of concurrent requests the GPU can serve.

Consider an extreme case: 1,000 unique system prompts, each used once. The cache fills with 1,000 prefixes, none of which will ever be reused. Meanwhile, the memory consumed by these cached prefixes reduces the available space for active requests, lowering throughput.

Effective KV budget=Total KV memoryCache memory\text{Effective KV budget} = \text{Total KV memory} - \text{Cache memory}

If the cache consumes 10 GB on an A100 with 40 GB available for KV, the effective KV budget drops by 25%. This means 25% fewer concurrent requests, which directly reduces throughput.

Cache Metadata Overhead

Each cached prefix requires metadata: hash entries (for APC) or tree nodes (for RadixAttention), reference counts, timestamps, block mappings. For a large cache with millions of entries, this metadata can consume significant CPU memory and introduce lookup latency.

In SGLang’s radix tree, each node stores:

  • Token IDs for its edge (variable length)
  • Pointers to child nodes
  • KV block references
  • Reference count and timestamps

For a tree with 100,000 nodes, this is a few hundred MB of CPU memory — typically not a bottleneck. But the tree traversal latency (microseconds) adds to the scheduling critical path.

Unique Prompts with No Sharing

If every request has a unique prefix (no shared system prompt, no multi-turn conversations, no repeated documents), the cache hit rate is zero. The overhead of hashing, cache lookups, and metadata management is pure waste. Both vLLM and SGLang allow disabling prefix caching for this reason.

Cache Thrashing

When the working set of active prefixes exceeds the cache capacity, prefixes are continuously evicted and re-promoted. Each promotion requires a data transfer (from CPU or SSD), and the promoted prefix may be evicted again before enough requests use it to amortize the transfer cost.

Thrash condition: isize(prefixi)>cache capacity\text{Thrash condition: } \sum_i \text{size}(prefix_i) > \text{cache capacity}

where the sum is over all prefixes accessed within one eviction window. The symptom is high cache miss rates despite non-zero sharing — the cache is churning through prefixes faster than it can benefit from them.

Throughput Impact of Prefix Caching by Sharing Level

(% throughput change)
95% shared prefix +45% throughput
45 % throughput change
75% shared prefix +28% throughput
28 % throughput change
50% shared prefix +12% throughput
12 % throughput change
25% shared prefix +3% throughput
3 % throughput change
5% shared (unique) -4% throughput
-4 % throughput change
0% shared (thrashing) -8% throughput
-8 % throughput change

The crossover point is around 20-30% prefix sharing. Below that, the memory overhead of caching outweighs the compute savings. Above it, prefix caching is a clear win.

🚨 Know Your Sharing Level Before Enabling

Prefix caching is not a universal win. Profile your request traffic to measure the actual prefix sharing rate before enabling it. If your workload has diverse, unique prompts with minimal sharing, disable prefix caching to reclaim GPU memory for concurrent requests.


8. Benchmarks: Hit Rates and TTFT Improvement

Hit Rate by Workload Type

The effectiveness of prefix caching depends entirely on the workload. Here is empirical data from SGLang benchmarks across different deployment scenarios:

📊

Prefix Cache Hit Rates by Workload Type

WorkloadPrefix LengthSharing RateCache Hit RateTTFT Reduction
Enterprise chatbot (single system prompt) 2,048 tokens 98% 96% 78%
Multi-tenant chatbot (10 system prompts) 1,500 tokens 90% 88% 65%
RAG pipeline (shared doc chunks) 4,096 tokens 60% 55% 42%
Multi-turn chat (5+ turns avg) 3,000 tokens 85% 82% 70%
Code completion (repo context) 2,000 tokens 40% 35% 25%
Batch processing (shared instructions) 1,000 tokens 95% 93% 55%
Open-ended API (diverse prompts) 500 tokens 15% 12% 5%
Note: Measured with SGLang RadixAttention on Llama-70B, A100-80GB. Hit rate accounts for LRU evictions under memory pressure.

TTFT Improvement at Scale

The TTFT improvement from prefix caching compounds with prompt length and request volume. Here is the TTFT distribution for a high-traffic enterprise chatbot deployment:

TTFT Distribution: With vs. Without Prefix Caching (1K QPS)

(ms)
No cache P50
48 ms
With cache P50
8 ms
No cache P95
95 ms
With cache P95
22 ms
No cache P99
180 ms
With cache P99
45 ms

At P50, prefix caching reduces TTFT by 83%. At P99, the reduction is 75%. The tail latency improvement is particularly valuable for user-facing applications with SLA requirements.

Throughput Impact

Prefix caching improves throughput in two ways: (1) freed prefill capacity can serve more requests, and (2) the reduced KV computation per request allows higher concurrency.

📊

Throughput Impact of Prefix Caching (Llama-70B, 8x A100-80GB)

ConfigurationThroughput (req/s)Avg TTFT (ms)GPU UtilizationCache Memory Used
No prefix caching 420 52 78% 0 GB
APC (vLLM) 580 14 82% 5.2 GB/GPU
RadixAttention (SGLang) 610 11 84% 6.1 GB/GPU
RadixAttention + CPU tier 625 13 85% 4.8 GB/GPU + 32 GB CPU
Note: Enterprise chatbot workload: 2K system prompt, 95% prefix sharing, ShareGPT-like conversation length distribution.

RadixAttention achieves ~45% higher throughput than no caching, with 79% lower TTFT. The CPU tier configuration trades slightly higher TTFT (due to occasional CPU-to-GPU transfers) for lower GPU memory usage, enabling even more concurrent requests.

SGLang vs. vLLM: Prefix Caching Performance

In head-to-head comparisons on prefix-heavy workloads, SGLang’s RadixAttention consistently outperforms vLLM’s hash-based APC. The margin is largest on multi-turn conversation workloads where the radix tree’s structural advantages matter most:

SGLang vs. vLLM Prefix Caching (Multi-Turn Chat, Llama-70B)

(requests/s)
vLLM (no cache) Baseline
420 requests/s
vLLM APC +33%
560 requests/s
+33.3%
SGLang RadixAttention +52%
640 requests/s
+52.4%

The 52% throughput gain from SGLang’s RadixAttention over the no-cache baseline, versus 33% for vLLM’s APC, reflects the radix tree’s better handling of multi-turn prefix sharing patterns. On simple single-prefix workloads (identical system prompt, no multi-turn), the gap narrows to within 5%.


9. Practical Deployment Guidance

When to Enable Prefix Caching

Enable prefix caching when:

  • Your workload has a shared system prompt used by more than 50% of requests
  • You serve multi-turn conversations (average 3+ turns)
  • You use RAG with document chunk sharing across queries
  • You run batch processing with shared instruction templates

Disable prefix caching when:

  • Each request has a unique, non-repeating prompt
  • Your GPU memory is the primary bottleneck and you need maximum concurrent requests
  • Your workload is latency-insensitive (batch processing where TTFT does not matter)

Configuration Recommendations

For vLLM with automatic prefix caching:

# Enable APC
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --enable-prefix-caching \
    --block-size 16 \
    --max-model-len 8192

For SGLang with RadixAttention:

# RadixAttention is enabled by default in SGLang
python -m sglang.launch_server \
    --model-path meta-llama/Llama-2-70b-hf \
    --mem-fraction-static 0.85 \
    --chunked-prefill-size 4096

Monitoring Cache Health

Track these metrics in production:

  • Cache hit rate: should be above 60% for prefix caching to be worthwhile
  • Cache memory usage: if approaching the budget, increase the eviction aggressiveness or add a CPU tier
  • TTFT distribution: compare with and without caching to quantify the benefit
  • Eviction rate: high eviction rates indicate cache thrashing — either increase capacity or reduce the prefix diversity
💡 The Virtuous Cycle

Prefix caching creates a positive feedback loop: by reducing prefill time, it frees GPU cycles, which allows serving more requests, which increases the chance of cache hits (more requests with shared prefixes in a given time window). At high QPS, prefix caching’s benefits compound superlinearly with traffic volume.


Conclusion

Prefix caching is one of the highest-leverage optimizations in LLM serving today. For workloads with shared prefixes — which is the majority of production deployments — it delivers 40-80% TTFT reduction and 25-50% throughput improvement at the cost of some GPU memory.

The key technical decision is the data structure: hash-based matching (vLLM APC) is simpler and works well for single-prefix workloads; the radix tree (SGLang RadixAttention) handles multi-turn conversations and branching patterns more efficiently. The cache hierarchy (GPU/CPU/SSD tiers) extends the effective cache capacity but adds transfer latency that must be amortized. Cache-aware routing ensures requests reach GPUs that have their prefixes cached, turning a per-GPU optimization into a cluster-level one.

The next post in this series covers LoRA and QLoRA for multi-adapter serving — how to personalize models for thousands of customers while sharing a single set of base model weights across all of them.