Every request to a production LLM starts with the same work: computing the KV cache for the system prompt. A chatbot with a 2,000-token system prompt recomputes those same key-value pairs for every single user message. A RAG pipeline with a shared retrieval preamble does the same. A coding assistant with a standardized instruction block does the same. The prefill computation is identical across thousands of requests per second, and you pay for it every time.
Prefix caching eliminates this redundancy. Instead of recomputing KV cache entries for tokens you have already processed, you store them and reuse them when the same prefix appears again. The idea sounds simple, but the implementation involves non-trivial data structures, multi-tier memory hierarchies, eviction policies, and routing decisions that interact with every other part of the serving system.
This post covers the full landscape: the quantitative case for prefix caching, vLLM’s hash-based automatic prefix caching, SGLang’s RadixAttention with its radix tree for hierarchical prefix matching, multi-tier cache hierarchies spanning GPU HBM to CPU DRAM to SSD, multi-turn conversation caching, cache-aware request routing, the failure modes where prefix caching hurts, and production benchmarks.
1. The Problem: Redundant Prefill Computation
The Economics of System Prompts
Modern LLM deployments almost universally use system prompts. A typical ChatGPT-style system prompt is 500-2,000 tokens. Enterprise deployments with detailed instructions, tool definitions, and persona specifications often reach 4,000-8,000 tokens. Every request must process these tokens during prefill before it can begin processing the user’s actual query.
Let us put numbers on this. For Llama-2-70B on an A100-80GB:
- Prefill throughput: roughly 2,000-4,000 tokens/second (depending on batch size and sequence length)
- System prompt length: 2,000 tokens
- Prefill time for system prompt: ~1-2.5 ms per token, so roughly 5 ms total for the system prompt alone
At first glance, 5 ms sounds trivial. But at scale:
You are spending 5 GPU-seconds every second just recomputing the same system prompt KV cache. That is 5 entire GPUs worth of prefill capacity, burned on redundant work. At 1,000 QPS with an 8-GPU node, that is 62.5% of your prefill capacity wasted on tokens you have already processed.
GPU-Seconds Wasted on Redundant System Prompt Prefill
(GPU-seconds/s)Beyond System Prompts
Redundant prefixes appear in many places beyond system prompts:
- Multi-turn conversations: each turn shares the entire conversation history as a prefix. Turn 5 of a conversation has turns 1-4 as a prefix that was already computed during turn 4.
- RAG pipelines: many queries retrieve the same popular documents. The prefill for those documents is identical.
- Few-shot prompting: the few-shot examples are the same across all requests. Only the query at the end changes.
- Batch processing: when processing a batch of items with the same instructions, every item shares the instruction prefix.
- A/B testing: different model versions or prompt variants still share large common prefixes.
The fundamental observation is that LLM request prefixes have high entropy at the tail but low entropy at the head. The system prompt, instruction block, and shared context are nearly identical across requests. The user-specific content comes at the end. Prefix caching exploits this asymmetry.
In production deployments, 60-90% of prefill tokens are shared across requests within a short time window. For enterprise chatbots with standardized system prompts, the sharing rate often exceeds 95%. Every shared token that is cached is a token you do not need to recompute.
What Prefix Caching Actually Saves
When you reuse a cached KV prefix, you skip the entire prefill computation for those tokens. The savings are:
- Compute: no QKV projections, no attention computation, no FFN forward pass for cached tokens
- Time-to-first-token (TTFT): the user sees the first generated token faster because prefill is shorter
- GPU utilization: freed prefill capacity can be used for new requests, increasing throughput
The cost you pay is memory: you must store the KV cache entries somewhere. For Llama-2-70B with FP16 KV cache, each token costs approximately 320 KB across all layers. A 2,000-token system prompt cache consumes ~640 MB of GPU memory per cached prefix. The tradeoff is memory for compute — a classic caching problem.
2. Automatic Prefix Caching (vLLM)
vLLM introduced automatic prefix caching (APC) as a built-in feature. The design is elegant in its simplicity: hash the token content of each KV block, and if the hash matches a previously computed block, reuse it instead of recomputing.
How It Works
Recall that vLLM’s PagedAttention divides KV cache into fixed-size blocks, each holding a fixed number of tokens (typically 16). Automatic prefix caching extends this with a hash table:
-
Block hashing: For each KV block, compute a hash based on the token IDs contained in that block plus the hash of the preceding block. This creates a chain of hashes that uniquely identifies the prefix up to that point.
-
Hash lookup: When a new request arrives, hash its tokens block by block. For each block, check if the hash exists in the cache. If yes, reuse the physical KV cache block. If no, allocate a new block and compute the KV entries.
-
Prefix matching: Because hashes are chained, a match at block implies matches at blocks through . The matching stops at the first miss.
def compute_block_hash(token_ids: List[int], prev_hash: int) -> int:
"""Hash a KV block based on its tokens and the previous block's hash.
The chaining ensures that two blocks with the same tokens but different
prefixes produce different hashes.
"""
block_content = tuple(token_ids)
return hash((prev_hash, block_content))
def find_cached_prefix(request_tokens: List[int], block_size: int,
hash_table: Dict[int, PhysicalBlock]) -> int:
"""Find the longest cached prefix for a request.
Returns the number of tokens that can be reused from cache.
"""
prev_hash = INITIAL_HASH
cached_blocks = 0
for i in range(0, len(request_tokens), block_size):
block_tokens = request_tokens[i:i + block_size]
if len(block_tokens) < block_size:
break # Partial block, cannot cache
block_hash = compute_block_hash(block_tokens, prev_hash)
if block_hash in hash_table:
cached_blocks += 1
prev_hash = block_hash
else:
break # First miss, stop matching
return cached_blocks * block_size
Hash Chain Design
The critical design decision is chaining the hashes. A naive approach would hash each block independently, but this is incorrect. Consider two sequences:
- Sequence A:
[system_prompt] [user_message_1] - Sequence B:
[different_prompt] [user_message_1]
If user_message_1 happens to fall at the same block boundary and contains the same tokens, an independent hash would incorrectly match the KV blocks. But the KV values are different because the attention computation depends on all preceding tokens. The hash chain captures this dependency: the hash of block depends on the hash of block , which depends on block , and so on back to block 0.
This gives us a property similar to a Merkle chain: a match at position guarantees correctness of the entire prefix.
Block Boundary Alignment
Hash-based prefix caching only matches at block boundaries. If the block size is 16, and two requests share a 1,000-token prefix, the cache can reuse blocks = 992 tokens. The remaining 8 tokens must be recomputed.
This creates a subtle interaction with prompt design. If your system prompt is 2,000 tokens, you get perfect block alignment. If it is 2,003 tokens, the last 3-token partial block cannot be cached, and worse, it shifts all subsequent blocks out of alignment with other requests that have different-length prefixes after the system prompt.
For maximum cache hit rates, ensure your system prompt length is a multiple of the KV block size (typically 16 tokens). If your prompt is 2,003 tokens, add 13 padding tokens (e.g., newlines) to reach 2,016. This costs ~0.6% extra compute on a miss but ensures every subsequent block aligns correctly.
Limitations of Hash-Based Matching
Hash-based APC has two key limitations:
-
All-or-nothing at each block: if even one token differs in a block, the entire block misses. There is no partial block reuse.
-
No longest-prefix matching across different prefixes: the hash table is a flat structure. Given a request, you can check if its specific prefix exists, but you cannot efficiently find “the closest prefix in the cache.” If you have cached prefix A (2,000 tokens) and prefix B (1,800 tokens), and a new request shares 1,500 tokens with A and 1,800 tokens with B, the hash-based approach finds the 1,500-token match with A by walking the chain — but it does not know about B unless it happens to check. In practice, the match is deterministic (the hash chain starting from the request’s tokens will match whichever cached chain shares its exact prefix), but the structure does not support finding the longest match among many cached prefixes efficiently.
These limitations motivated SGLang’s radix tree approach.
3. RadixAttention (SGLang): Hierarchical Prefix Matching
SGLang introduced RadixAttention, which replaces the flat hash table with a radix tree (also known as a compressed trie or Patricia tree). This data structure is fundamentally better suited to the prefix matching problem.
Why a Radix Tree?
A radix tree stores strings (in this case, token sequences) in a tree where:
- Each edge represents a sequence of tokens (not just one token, which is the key difference from a plain trie)
- The path from root to any node represents the full prefix up to that point
- Shared prefixes are stored exactly once, with branching at the point of divergence
This gives three advantages over hash-based matching:
-
Efficient longest-prefix lookup: given a new token sequence, traverse the tree from the root, following edges that match the input tokens. When you can no longer follow an edge, you have found the longest cached prefix. This is in the length of the input, with no need to check block boundaries.
-
Partial prefix sharing: if two prefixes share the first 1,000 tokens but diverge after that, the tree stores the shared 1,000 tokens once and branches into two paths. The hash-based approach stores them twice (in two separate block chains).
-
Multi-turn conversation support: a conversation with 5 turns naturally forms a path in the tree. Each new turn extends the path. If the user backtracks (e.g., regenerates turn 3), the tree branches at turn 3.
Worked Example: The Radix Tree in Action
Consider a serving system handling three types of requests:
- Type A: System prompt (2,000 tokens) + User message 1 (100 tokens)
- Type B: System prompt (2,000 tokens) + User message 2 (150 tokens)
- Type C: Different system prompt (1,500 tokens) + User message 3 (200 tokens)
After processing these requests, the radix tree looks like:
Root
├── [System Prompt: 2000 tokens] ─── Node S
│ ├── [User Message 1: 100 tokens] ─── Leaf A
│ └── [User Message 2: 150 tokens] ─── Leaf B
└── [Different System Prompt: 1500 tokens] ─── Node D
└── [User Message 3: 200 tokens] ─── Leaf C
Now a fourth request arrives: System prompt (2,000 tokens) + User message 4 (80 tokens). The tree lookup proceeds:
- Start at Root. Compare tokens against child edges.
- Match the “System Prompt” edge (2,000 tokens) — full match. Arrive at Node S.
- Compare remaining tokens (User Message 4) against Node S’s child edges. No match — “User Message 4” differs from both “User Message 1” and “User Message 2.”
- Result: reuse the KV cache for the 2,000-token system prompt. Only compute KV for the 80 new tokens.
The tree is updated to:
Root
├── [System Prompt: 2000 tokens] ─── Node S
│ ├── [User Message 1: 100 tokens] ─── Leaf A
│ ├── [User Message 2: 150 tokens] ─── Leaf B
│ └── [User Message 4: 80 tokens] ─── Leaf D (NEW)
└── [Different System Prompt: 1500 tokens] ─── Node D
└── [User Message 3: 200 tokens] ─── Leaf C
When a new sequence partially matches an existing edge, the edge is split. For example, if User Message 5 shares the first 60 tokens with User Message 1 but diverges after that, the “User Message 1” edge is split into a 60-token shared edge and two child edges (one for the remaining 40 tokens of Message 1, one for the remaining tokens of Message 5). This is the standard radix tree split operation.
Multi-Turn Conversations in the Radix Tree
Multi-turn conversations are where RadixAttention truly shines. Consider a 4-turn conversation:
- Turn 1:
[System] [User1] [Assistant1] - Turn 2:
[System] [User1] [Assistant1] [User2] [Assistant2] - Turn 3:
[System] [User1] [Assistant1] [User2] [Assistant2] [User3] [Assistant3] - Turn 4:
[System] [User1] [Assistant1] [User2] [Assistant2] [User3] [Assistant3] [User4]
Each turn is a strict prefix extension of the previous turn. In the radix tree, this is a single path that extends deeper with each turn. When Turn 4 arrives, the tree lookup finds that all tokens through the end of Turn 3 (the assistant’s response) are already cached. Only [User4] needs new KV computation.
With hash-based APC, this works too — as long as all tokens align to block boundaries. But the radix tree handles it more gracefully when token boundaries shift, and the tree structure makes it trivial to implement:
class RadixTreeNode:
def __init__(self):
self.children: Dict[int, 'RadixTreeNode'] = {}
self.token_ids: List[int] = []
self.kv_block_ids: List[int] = [] # References to physical KV blocks
self.ref_count: int = 0 # Number of active requests using this node
self.last_access: float = 0.0 # For LRU eviction
def match_prefix(self, tokens: List[int]) -> Tuple['RadixTreeNode', int]:
"""Find the deepest node matching the given token sequence.
Returns (node, matched_length).
"""
matched = 0
current = self
while matched < len(tokens):
next_token = tokens[matched]
if next_token not in current.children:
break
child = current.children[next_token]
edge_tokens = child.token_ids
# Check how many tokens on this edge match
edge_match = 0
while (edge_match < len(edge_tokens) and
matched + edge_match < len(tokens) and
edge_tokens[edge_match] == tokens[matched + edge_match]):
edge_match += 1
matched += edge_match
if edge_match == len(edge_tokens):
current = child # Full edge match, continue
else:
break # Partial edge match
return current, matched
Radix Tree vs. Hash Table: Performance Comparison
Prefix Matching: Radix Tree (SGLang) vs. Hash Table (vLLM APC)
| Scenario | Hash Table Matches | Radix Tree Matches | Tokens Saved (Radix) | Advantage |
|---|---|---|---|---|
| Identical system prompt (2K tokens) | 2,000 | 2,000 | 2,000 | Equal |
| Multi-turn (turn 5, 8K history) | 7,984 (block-aligned) | 8,000 | 8,000 | Slight edge |
| Partial prefix overlap (1,500/2,000) | 1,488 (block-aligned) | 1,500 | 1,500 | Slight edge |
| Branching conversations (3 variants) | Best single match | All 3 paths cached | Variable | Structural advantage |
| Dynamic few-shot (varying examples) | Misses on reordering | Matches shared prefix | Variable | Structural advantage |
For the common case of identical system prompts, both approaches perform similarly. The radix tree’s advantages emerge in more complex sharing patterns: branching conversations, dynamic prompt construction, and cases where block alignment matters.
4. Cache Hierarchies: GPU, CPU, and SSD Tiers
KV cache entries are large. A 2,000-token prefix for Llama-70B consumes ~640 MB. You cannot keep unlimited prefixes in GPU HBM. A cache hierarchy addresses this by storing hot prefixes on fast storage and cold prefixes on slower, cheaper storage.
The Three-Tier Architecture
KV Cache Hierarchy for Prefix Caching
0x3000 0x0000 0x8000 0x3000 0xF000 0x8000 10-30 GB 64-256 GB 1-4 TB Each tier has different capacity, latency, and bandwidth characteristics:
Tier 1: GPU HBM (Hot)
- Capacity: Whatever GPU memory remains after model weights and active request KV caches. For Llama-70B FP16 on an 80 GB A100, model weights consume ~35 GB, and active requests use 20-40 GB. That leaves 5-25 GB for cached prefixes.
- Access latency: Essentially zero — the KV blocks are already where they need to be.
- Best for: The most frequently accessed prefixes (e.g., the single system prompt used by 95% of requests).
Tier 2: CPU DRAM (Warm)
- Capacity: 64-512 GB on a typical server. Can store hundreds of cached prefixes.
- Access latency: Transfer from CPU to GPU over PCIe 4.0 x16 (32 GB/s bidirectional). A 640 MB prefix transfer takes ~20 ms. With PCIe 5.0, this halves to ~10 ms.
- Best for: Prefixes accessed every few seconds. The transfer cost is amortized over many requests that benefit from the cached prefix.
Tier 3: SSD (Cold)
- Capacity: 1-8 TB on NVMe. Can store thousands of cached prefixes.
- Access latency: NVMe sequential read at 5-7 GB/s. A 640 MB prefix load takes ~90-130 ms. With pipelining and direct GPU access (GPUDirect Storage), this can be reduced to ~50 ms.
- Best for: Infrequently accessed prefixes that are expensive to recompute. A 50 ms load from SSD is still much cheaper than a 500 ms prefill for an 8,000-token prefix.
The SSD tier is only worthwhile when the prefix recomputation cost exceeds the SSD load cost. For a 2,000-token prefix (5 ms prefill), loading from SSD (50-100 ms) is 10-20x worse than recomputing. For an 8,000-token prefix (50-100 ms prefill), the SSD load time is comparable. The breakeven is roughly 4,000-6,000 tokens, depending on GPU speed and SSD bandwidth.
Eviction Policies
Each tier operates with a bounded memory budget. When the tier is full and a new prefix needs to be cached, something must be evicted. The standard approach is LRU (Least Recently Used) within each tier, with promotion and demotion between tiers.
Eviction flow:
- A prefix in GPU HBM is not accessed for a threshold period. It is demoted to CPU DRAM: the KV blocks are copied to CPU memory, and the GPU blocks are freed.
- A prefix in CPU DRAM is not accessed for a longer threshold. It is demoted to SSD: the KV blocks are written to NVMe, and the CPU memory is freed.
- A prefix in SSD is not accessed for an extended period. It is evicted entirely: the SSD space is freed. The prefix must be recomputed from scratch if needed again.
Promotion flow:
- A request arrives that matches a prefix cached in CPU DRAM. The KV blocks are promoted (copied) to GPU HBM. The prefill for those tokens is skipped.
- A request matches a prefix cached on SSD. The KV blocks are loaded to GPU HBM (possibly via CPU DRAM as a staging area).
class CacheHierarchy:
def __init__(self, gpu_budget_gb: float, cpu_budget_gb: float,
ssd_budget_gb: float):
self.gpu_cache = LRUCache(capacity_bytes=int(gpu_budget_gb * 1e9))
self.cpu_cache = LRUCache(capacity_bytes=int(cpu_budget_gb * 1e9))
self.ssd_cache = LRUCache(capacity_bytes=int(ssd_budget_gb * 1e9))
def lookup(self, prefix_hash: int) -> Tuple[Optional[KVBlocks], str]:
"""Look up a prefix across all tiers. Returns (blocks, tier)."""
# Check GPU first (fastest)
blocks = self.gpu_cache.get(prefix_hash)
if blocks is not None:
return blocks, "gpu"
# Check CPU (medium)
blocks = self.cpu_cache.get(prefix_hash)
if blocks is not None:
# Promote to GPU
gpu_blocks = self.transfer_cpu_to_gpu(blocks)
self.gpu_cache.put(prefix_hash, gpu_blocks)
return gpu_blocks, "cpu"
# Check SSD (slowest)
blocks = self.ssd_cache.get(prefix_hash)
if blocks is not None:
# Promote to GPU (via CPU staging)
gpu_blocks = self.transfer_ssd_to_gpu(blocks)
self.gpu_cache.put(prefix_hash, gpu_blocks)
return gpu_blocks, "ssd"
return None, "miss"
def evict_gpu(self, prefix_hash: int, blocks: KVBlocks):
"""Demote from GPU to CPU."""
cpu_blocks = self.transfer_gpu_to_cpu(blocks)
self.cpu_cache.put(prefix_hash, cpu_blocks)
self.gpu_cache.remove(prefix_hash)
Transfer Costs: The Hidden Tax
Tier promotion is not free. The transfer latency becomes part of the TTFT for the request that triggers it.
Prefix Cache Tier Transfer Costs (Llama-70B, 2K-Token Prefix = 640 MB)
| Source Tier | Transfer BW | Transfer Time | vs. Recompute (5 ms) | Worthwhile? |
|---|---|---|---|---|
| GPU HBM (hit) | N/A | 0 ms | Saves 5 ms | Always |
| CPU DRAM (PCIe 4.0) | 32 GB/s | ~20 ms | Costs 15 ms extra | No for 2K prefix |
| CPU DRAM (PCIe 5.0) | 64 GB/s | ~10 ms | Costs 5 ms extra | Breakeven |
| SSD (NVMe Gen4) | 7 GB/s | ~91 ms | Costs 86 ms extra | No for 2K prefix |
| SSD (NVMe Gen5) | 14 GB/s | ~46 ms | Costs 41 ms extra | No for 2K prefix |
The key insight from this table: tier promotion is only worthwhile when the prefix is long enough or popular enough. A 2,000-token prefix that takes 5 ms to recompute is not worth loading from CPU DRAM (20 ms) unless you expect many subsequent requests to use it. The amortized cost is:
For the CPU DRAM case: if 100 requests will share the promoted prefix, the amortized cost is 20 ms / 100 = 0.2 ms, which is much better than the 5 ms recompute. The critical parameter is the expected number of future requests that will use the prefix before it is evicted again.
5. Multi-Turn Conversation Caching
Multi-turn conversations are the most natural and impactful application of prefix caching. Each turn in a conversation is a strict extension of the previous turn’s context.
The Structure of Multi-Turn Requests
Consider a 5-turn conversation. Each request to the LLM looks like:
| Turn | Tokens Sent to Model | New Tokens | Cached Prefix Tokens |
|---|---|---|---|
| 1 | [Sys] + [U1] = 2,100 | 2,100 | 0 |
| 2 | [Sys] + [U1] + [A1] + [U2] = 2,500 | 400 | 2,100 |
| 3 | [Sys] + [U1] + [A1] + [U2] + [A2] + [U3] = 3,200 | 700 | 2,500 |
| 4 | [Sys] + [...history...] + [U4] = 4,000 | 800 | 3,200 |
| 5 | [Sys] + [...history...] + [U5] = 4,600 | 600 | 4,000 |
Without prefix caching, every turn recomputes the full context from scratch. Turn 5 recomputes 4,600 tokens, of which 4,000 were already computed during turn 4.
With prefix caching, turn 5 only computes KV for the 600 new tokens. The TTFT savings grow with each turn:
TTFT per Turn: With vs. Without Prefix Caching (Llama-70B, A100)
(ms)By turn 5, prefix caching reduces TTFT by 87% — from 115 ms to 15 ms. The user experiences dramatically more responsive interactions in later conversation turns.
Cache Invalidation in Multi-Turn
Multi-turn caching has a subtle correctness concern: the cached prefix is only valid if the conversation history is byte-for-byte identical. This means:
- No prompt rewriting: if the system modifies previous turns (e.g., summarizing old context, trimming messages), the cache is invalidated.
- Deterministic tokenization: the tokens for the cached prefix must be identical. If the tokenizer produces different tokens for the same text (rare but possible with some normalization settings), the cache misses.
- Temperature and sampling do not affect cache validity: the KV cache is computed during prefill, which is deterministic regardless of sampling parameters. Different sampling settings for the same prefix share the same KV cache.
Branching Conversations
Users sometimes regenerate a response (click “regenerate” in a chat UI). This creates a branching conversation:
[System] [User1] [Assistant1_v1] [User2] ...
[Assistant1_v2] [User3] ...
Both branches share the prefix [System] [User1], but diverge at the assistant response. The radix tree handles this naturally — it creates a branch at the divergence point. The hash-based approach also handles it correctly, since the different assistant tokens produce different block hashes.
The interesting optimization is speculative caching of the branch point. When a user sends a message that could plausibly lead to a regeneration, the system can keep the pre-assistant KV cache pinned in GPU memory for a short window, enabling instant prefix reuse if the user regenerates.
6. Cache-Aware Request Routing
In a multi-GPU or multi-node serving deployment, each GPU maintains its own prefix cache. When a new request arrives, the router must decide which GPU to send it to. The choice of GPU has a profound effect on cache hit rates.
The Routing Tradeoff
There are two competing objectives:
- Cache locality: route the request to the GPU that has its prefix cached. This maximizes cache hits and minimizes TTFT.
- Load balance: route the request to the least-loaded GPU. This prevents hotspots and maintains throughput.
These objectives conflict. If 90% of requests share the same system prompt, cache locality wants to send them all to GPU 0 (which has the cached prefix). But this overloads GPU 0 while GPUs 1-7 sit idle.
Cache-Aware Routing Strategies
Strategy 1: Prefix-Hash Routing
Compute a hash of the request’s prefix (or its first tokens) and use consistent hashing to route to a GPU. Requests with the same prefix always go to the same GPU. This maximizes cache hits for the dominant prefix but creates hotspots when one prefix is much more popular than others.
def route_by_prefix_hash(request, gpu_pool):
"""Route based on prefix hash. Good locality, poor balance."""
prefix_hash = hash(tuple(request.tokens[:ROUTE_PREFIX_LEN]))
gpu_idx = prefix_hash % len(gpu_pool)
return gpu_pool[gpu_idx]
Strategy 2: Cache-Aware Load Balancing
Check which GPUs have the request’s prefix cached, then choose the least-loaded among those. If no GPU has the prefix, fall back to pure load balancing.
def route_cache_aware(request, gpu_pool, cache_registry):
"""Route to least-loaded GPU that has the prefix cached."""
prefix_tokens = tuple(request.tokens[:request.system_prompt_len])
# Find GPUs with cached prefix
gpus_with_cache = [
gpu for gpu in gpu_pool
if cache_registry.has_prefix(gpu.id, prefix_tokens)
]
if gpus_with_cache:
# Choose least loaded among GPUs with cache
return min(gpus_with_cache, key=lambda g: g.active_requests)
else:
# No cache hit possible, use pure load balancing
return min(gpu_pool, key=lambda g: g.active_requests)
Strategy 3: Replicated Caching with Affinity Groups
Replicate popular prefixes across multiple GPUs (not all — a subset called the “affinity group”). Route requests to any GPU in the affinity group, balancing load within the group.
This is the approach most production systems converge on. For a system prompt used by 80% of traffic and an 8-GPU deployment:
- GPUs 0-5 (6 GPUs): cache the primary system prompt. 80% of traffic is distributed across these 6 GPUs.
- GPUs 6-7 (2 GPUs): handle requests with rare or unique prefixes. 20% of traffic goes here.
The affinity group size is chosen to balance cache memory overhead (each GPU in the group spends memory on the cached prefix) against load distribution.
Routing Strategy Comparison (8 GPUs, 80% Shared Prefix Traffic)
| Strategy | Cache Hit Rate | Load Imbalance | Avg TTFT | P99 TTFT |
|---|---|---|---|---|
| Round-robin (no cache awareness) | 12.5% | Low | 52 ms | 115 ms |
| Prefix-hash routing | 95% | Very high (8x) | 8 ms (median) | 280 ms (tail) |
| Cache-aware load balance | 85% | Moderate | 14 ms | 65 ms |
| Affinity groups (6+2 split) | 92% | Low | 10 ms | 48 ms |
The affinity group approach achieves the best balance: near-optimal cache hit rates with acceptable load distribution. Pure prefix-hash routing has the highest hit rate but terrible P99 latency because it creates severe hotspots.
Cache-aware routing requires the router to know which GPUs have which prefixes cached. This information can become stale if the cache evicts a prefix between the routing decision and the request execution. In practice, systems maintain an approximate cache registry that is updated asynchronously, accepting occasional cache misses rather than adding synchronous registry lookups to the critical path.
7. When Prefix Caching Hurts
Prefix caching is not free. There are scenarios where it degrades performance rather than improving it.
Cache Fragmentation
Prefix caching reserves GPU memory for cached KV blocks. These blocks may not be contiguous, and they cannot be used for active request KV cache storage. If the cache holds many stale prefixes that are rarely reused, the memory they consume directly reduces the number of concurrent requests the GPU can serve.
Consider an extreme case: 1,000 unique system prompts, each used once. The cache fills with 1,000 prefixes, none of which will ever be reused. Meanwhile, the memory consumed by these cached prefixes reduces the available space for active requests, lowering throughput.
If the cache consumes 10 GB on an A100 with 40 GB available for KV, the effective KV budget drops by 25%. This means 25% fewer concurrent requests, which directly reduces throughput.
Cache Metadata Overhead
Each cached prefix requires metadata: hash entries (for APC) or tree nodes (for RadixAttention), reference counts, timestamps, block mappings. For a large cache with millions of entries, this metadata can consume significant CPU memory and introduce lookup latency.
In SGLang’s radix tree, each node stores:
- Token IDs for its edge (variable length)
- Pointers to child nodes
- KV block references
- Reference count and timestamps
For a tree with 100,000 nodes, this is a few hundred MB of CPU memory — typically not a bottleneck. But the tree traversal latency (microseconds) adds to the scheduling critical path.
Unique Prompts with No Sharing
If every request has a unique prefix (no shared system prompt, no multi-turn conversations, no repeated documents), the cache hit rate is zero. The overhead of hashing, cache lookups, and metadata management is pure waste. Both vLLM and SGLang allow disabling prefix caching for this reason.
Cache Thrashing
When the working set of active prefixes exceeds the cache capacity, prefixes are continuously evicted and re-promoted. Each promotion requires a data transfer (from CPU or SSD), and the promoted prefix may be evicted again before enough requests use it to amortize the transfer cost.
where the sum is over all prefixes accessed within one eviction window. The symptom is high cache miss rates despite non-zero sharing — the cache is churning through prefixes faster than it can benefit from them.
Throughput Impact of Prefix Caching by Sharing Level
(% throughput change)The crossover point is around 20-30% prefix sharing. Below that, the memory overhead of caching outweighs the compute savings. Above it, prefix caching is a clear win.
Prefix caching is not a universal win. Profile your request traffic to measure the actual prefix sharing rate before enabling it. If your workload has diverse, unique prompts with minimal sharing, disable prefix caching to reclaim GPU memory for concurrent requests.
8. Benchmarks: Hit Rates and TTFT Improvement
Hit Rate by Workload Type
The effectiveness of prefix caching depends entirely on the workload. Here is empirical data from SGLang benchmarks across different deployment scenarios:
Prefix Cache Hit Rates by Workload Type
| Workload | Prefix Length | Sharing Rate | Cache Hit Rate | TTFT Reduction |
|---|---|---|---|---|
| Enterprise chatbot (single system prompt) | 2,048 tokens | 98% | 96% | 78% |
| Multi-tenant chatbot (10 system prompts) | 1,500 tokens | 90% | 88% | 65% |
| RAG pipeline (shared doc chunks) | 4,096 tokens | 60% | 55% | 42% |
| Multi-turn chat (5+ turns avg) | 3,000 tokens | 85% | 82% | 70% |
| Code completion (repo context) | 2,000 tokens | 40% | 35% | 25% |
| Batch processing (shared instructions) | 1,000 tokens | 95% | 93% | 55% |
| Open-ended API (diverse prompts) | 500 tokens | 15% | 12% | 5% |
TTFT Improvement at Scale
The TTFT improvement from prefix caching compounds with prompt length and request volume. Here is the TTFT distribution for a high-traffic enterprise chatbot deployment:
TTFT Distribution: With vs. Without Prefix Caching (1K QPS)
(ms)At P50, prefix caching reduces TTFT by 83%. At P99, the reduction is 75%. The tail latency improvement is particularly valuable for user-facing applications with SLA requirements.
Throughput Impact
Prefix caching improves throughput in two ways: (1) freed prefill capacity can serve more requests, and (2) the reduced KV computation per request allows higher concurrency.
Throughput Impact of Prefix Caching (Llama-70B, 8x A100-80GB)
| Configuration | Throughput (req/s) | Avg TTFT (ms) | GPU Utilization | Cache Memory Used |
|---|---|---|---|---|
| No prefix caching | 420 | 52 | 78% | 0 GB |
| APC (vLLM) | 580 | 14 | 82% | 5.2 GB/GPU |
| RadixAttention (SGLang) | 610 | 11 | 84% | 6.1 GB/GPU |
| RadixAttention + CPU tier | 625 | 13 | 85% | 4.8 GB/GPU + 32 GB CPU |
RadixAttention achieves ~45% higher throughput than no caching, with 79% lower TTFT. The CPU tier configuration trades slightly higher TTFT (due to occasional CPU-to-GPU transfers) for lower GPU memory usage, enabling even more concurrent requests.
SGLang vs. vLLM: Prefix Caching Performance
In head-to-head comparisons on prefix-heavy workloads, SGLang’s RadixAttention consistently outperforms vLLM’s hash-based APC. The margin is largest on multi-turn conversation workloads where the radix tree’s structural advantages matter most:
SGLang vs. vLLM Prefix Caching (Multi-Turn Chat, Llama-70B)
(requests/s)The 52% throughput gain from SGLang’s RadixAttention over the no-cache baseline, versus 33% for vLLM’s APC, reflects the radix tree’s better handling of multi-turn prefix sharing patterns. On simple single-prefix workloads (identical system prompt, no multi-turn), the gap narrows to within 5%.
9. Practical Deployment Guidance
When to Enable Prefix Caching
Enable prefix caching when:
- Your workload has a shared system prompt used by more than 50% of requests
- You serve multi-turn conversations (average 3+ turns)
- You use RAG with document chunk sharing across queries
- You run batch processing with shared instruction templates
Disable prefix caching when:
- Each request has a unique, non-repeating prompt
- Your GPU memory is the primary bottleneck and you need maximum concurrent requests
- Your workload is latency-insensitive (batch processing where TTFT does not matter)
Configuration Recommendations
For vLLM with automatic prefix caching:
# Enable APC
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--enable-prefix-caching \
--block-size 16 \
--max-model-len 8192
For SGLang with RadixAttention:
# RadixAttention is enabled by default in SGLang
python -m sglang.launch_server \
--model-path meta-llama/Llama-2-70b-hf \
--mem-fraction-static 0.85 \
--chunked-prefill-size 4096
Monitoring Cache Health
Track these metrics in production:
- Cache hit rate: should be above 60% for prefix caching to be worthwhile
- Cache memory usage: if approaching the budget, increase the eviction aggressiveness or add a CPU tier
- TTFT distribution: compare with and without caching to quantify the benefit
- Eviction rate: high eviction rates indicate cache thrashing — either increase capacity or reduce the prefix diversity
Prefix caching creates a positive feedback loop: by reducing prefill time, it frees GPU cycles, which allows serving more requests, which increases the chance of cache hits (more requests with shared prefixes in a given time window). At high QPS, prefix caching’s benefits compound superlinearly with traffic volume.
Conclusion
Prefix caching is one of the highest-leverage optimizations in LLM serving today. For workloads with shared prefixes — which is the majority of production deployments — it delivers 40-80% TTFT reduction and 25-50% throughput improvement at the cost of some GPU memory.
The key technical decision is the data structure: hash-based matching (vLLM APC) is simpler and works well for single-prefix workloads; the radix tree (SGLang RadixAttention) handles multi-turn conversations and branching patterns more efficiently. The cache hierarchy (GPU/CPU/SSD tiers) extends the effective cache capacity but adds transfer latency that must be amortized. Cache-aware routing ensures requests reach GPUs that have their prefixes cached, turning a per-GPU optimization into a cluster-level one.
The next post in this series covers LoRA and QLoRA for multi-adapter serving — how to personalize models for thousands of customers while sharing a single set of base model weights across all of them.