I’ve watched production chatbots send the same 512-token system prompt through the prefill pipeline 10,000 times in a single day. “You are a helpful assistant trained by OpenAI…” — compute the query projections, multiply by key projections, run softmax, weight the value projections, do this across 80 layers, 40 attention heads, for 512 tokens. Takes 80 milliseconds on an H100. Then the next user message arrives and we do it all over again. The system prompt never changes. The KV cache for those 512 tokens is identical every single time. Without prefix caching, we burn 80ms per request on redundant computation. With prefix caching, the second request that shares the same system prompt skips the prefill entirely and jumps straight to processing the user’s actual message. The savings compound: at 100 requests per second, prefix caching recovers 8 full seconds of GPU time every second — time that can process more requests or serve longer contexts.
Prefix caching reuses KV cache blocks from previous requests that share the same prompt prefix. A chatbot system prompt prepended to every user message is the canonical example: the system prompt’s KV cache is computed once and reused by all subsequent requests. Without prefix caching, every request pays the full prefill cost for the shared prefix. With prefix caching, the second request that shares the same system prompt skips the prefill for those tokens entirely.
vLLM v0 had prefix caching, but it imposed overhead even when the hit rate was zero. The cache lookup added latency to every block allocation, and the hash computation consumed CPU cycles regardless of whether any prefix was actually shared. vLLM v1 redesigns prefix caching with a zero-overhead-at-miss architecture: when no prefix is shared, the cache lookup degenerates to a hash table probe that costs under 100 nanoseconds per block. When prefixes are shared, the savings are proportional to the shared prefix length.
Block-Level KV Cache Review
vLLM’s KV cache is divided into fixed-size blocks, each holding tokens (default ). For a sequence of length , the number of blocks is . Each block stores the K and V tensors for its tokens across all layers:
For Llama 70B with GQA-8 (8 KV heads, 128 head dim, 80 layers) in FP16:
On an H100 with 80 GB HBM, after the model weights (35 GB for INT4 quantized 70B), approximately 45 GB is available for KV cache, holding blocks, or tokens of cached context.
H100 Memory Layout: Llama 70B INT4 with KV Cache
Content-Addressable Hash Chains
The Hash Function
The core idea: each KV cache block is identified by the hash of its content (the token IDs it contains) chained with the hash of the preceding block. This creates a content-addressable chain where any prefix can be looked up by computing its block hashes.
import hashlib
import struct
class BlockHasher:
"""
Compute content-addressable hashes for KV cache blocks.
Each block's hash depends on:
1. The token IDs in the block
2. The hash of the previous block (chain)
3. The block position index (prevents collisions between
identical token sequences at different positions)
Hash chain:
h_0 = H(tokens[0:B], position=0, prev=INITIAL)
h_1 = H(tokens[B:2B], position=1, prev=h_0)
h_i = H(tokens[i*B:(i+1)*B], position=i, prev=h_{i-1})
"""
INITIAL_HASH = b'\x00' * 32 # SHA-256 of empty
def __init__(self, block_size=16):
self.block_size = block_size
def compute_chain(self, token_ids):
"""
Compute the full hash chain for a token sequence.
Returns: list of (block_hash, block_start, block_end)
"""
num_blocks = (len(token_ids) + self.block_size - 1) // self.block_size
chain = []
prev_hash = self.INITIAL_HASH
for i in range(num_blocks):
start = i * self.block_size
end = min(start + self.block_size, len(token_ids))
block_tokens = token_ids[start:end]
# Pad incomplete blocks with a sentinel value
if len(block_tokens) < self.block_size:
block_tokens = block_tokens + [-1] * (self.block_size - len(block_tokens))
block_hash = self._hash_block(block_tokens, i, prev_hash)
chain.append((block_hash, start, end))
prev_hash = block_hash
return chain
def _hash_block(self, tokens, position, prev_hash):
"""
Hash a single block.
Using SHA-256 for collision resistance. In production,
xxHash or CityHash is used for speed (collision risk is
acceptable given the application).
"""
h = hashlib.sha256()
h.update(prev_hash)
h.update(struct.pack('<I', position))
for token in tokens:
h.update(struct.pack('<i', token))
return h.digest()
def compute_prefix_hashes(self, token_ids):
"""
Compute hashes only for complete blocks.
Incomplete trailing block is not hashed (it may still
be in the process of being filled during prefill).
"""
num_complete_blocks = len(token_ids) // self.block_size
chain = []
prev_hash = self.INITIAL_HASH
for i in range(num_complete_blocks):
start = i * self.block_size
end = start + self.block_size
block_tokens = token_ids[start:end]
block_hash = self._hash_block(block_tokens, i, prev_hash)
chain.append(block_hash)
prev_hash = block_hash
return chain
Why Chaining Matters
Without chaining, two different prompts that happen to share the same 16 tokens at different positions would produce the same block hash, causing incorrect cache hits. Chaining ensures that a block’s hash encodes the entire prefix up to that block. Two blocks hash to the same value only if every preceding block also matches — which means the entire prefix is identical.
# Example: Why position matters
prompt_a = "The capital of France is Paris. The weather is"
prompt_b = "Paris. The weather is nice. The capital of France is"
# Without chaining, if "Paris. The weather is" appears in both,
# the block containing those tokens would have the same hash.
# With chaining, the hashes differ because the preceding blocks differ.
hasher = BlockHasher(block_size=16)
chain_a = hasher.compute_chain(tokenize(prompt_a))
chain_b = hasher.compute_chain(tokenize(prompt_b))
# chain_a[2] != chain_b[0] even if the tokens are identical,
# because the previous block hashes differ.
The Prefix Cache: Hash Table Design
from enum import Enum
from collections import OrderedDict
import time
class BlockStatus(Enum):
FREE = 0 # Block is unallocated
ALLOCATED = 1 # Block is in use by an active request
CACHED = 2 # Block is in the prefix cache (no active reference)
class CachedBlock:
"""Metadata for a block in the prefix cache."""
__slots__ = [
'block_id', 'block_hash', 'ref_count',
'last_access_time', 'status', 'prev_block', 'token_ids'
]
def __init__(self, block_id, block_hash, token_ids):
self.block_id = block_id # Physical block index in KV pool
self.block_hash = block_hash # Content hash (from hash chain)
self.ref_count = 0 # Number of active requests using this block
self.last_access_time = time.monotonic()
self.status = BlockStatus.FREE
self.prev_block = None # Previous block in chain (for chain validation)
self.token_ids = token_ids # Token content (for verification)
class PrefixCache:
"""
Content-addressable prefix cache for KV blocks.
Design goals:
1. O(1) lookup per block (hash table probe)
2. Zero overhead when hit rate is 0% (just the probe cost)
3. LRU eviction within unreferenced blocks
4. Reference counting for safe eviction
"""
def __init__(self, num_blocks, block_size=16):
self.num_blocks = num_blocks
self.block_size = block_size
# Hash table: block_hash -> CachedBlock
self.hash_table = {}
# Block pool: block_id -> CachedBlock
self.blocks = {}
for i in range(num_blocks):
block = CachedBlock(i, None, None)
block.status = BlockStatus.FREE
self.blocks[i] = block
# Free list: unallocated blocks (O(1) allocation)
self.free_blocks = list(range(num_blocks))
# Unreferenced LRU: cached blocks with ref_count=0
# OrderedDict for O(1) LRU access
self.unreferenced_lru = OrderedDict()
# Statistics
self.stats = PrefixCacheStats()
def lookup_prefix(self, token_ids):
"""
Find the longest cached prefix for a token sequence.
Returns: (num_cached_tokens, list of matched block_ids)
This is the hot path for prefix caching.
Cost: O(num_blocks_in_prefix) hash table lookups.
"""
hasher = BlockHasher(self.block_size)
block_hashes = hasher.compute_prefix_hashes(token_ids)
matched_blocks = []
for block_hash in block_hashes:
cached = self.hash_table.get(block_hash)
if cached is None:
# Cache miss: prefix ends here
break
# Verify: check that block content matches
# (defense against hash collisions)
expected_start = len(matched_blocks) * self.block_size
expected_tokens = token_ids[expected_start:expected_start + self.block_size]
if cached.token_ids != tuple(expected_tokens):
# Hash collision: treat as miss
self.stats.collisions += 1
break
# Cache hit: reference this block
self._reference_block(cached)
matched_blocks.append(cached.block_id)
self.stats.block_hits += 1
self.stats.block_misses += len(block_hashes) - len(matched_blocks)
num_cached_tokens = len(matched_blocks) * self.block_size
return num_cached_tokens, matched_blocks
def allocate_blocks(self, token_ids, num_cached_tokens):
"""
Allocate blocks for the uncached portion of a request.
The first `num_cached_tokens / block_size` blocks are already
matched (from lookup_prefix). This allocates blocks for the rest.
Returns: list of newly allocated block_ids
"""
total_blocks_needed = (len(token_ids) + self.block_size - 1) // self.block_size
cached_blocks = num_cached_tokens // self.block_size
new_blocks_needed = total_blocks_needed - cached_blocks
allocated = []
for _ in range(new_blocks_needed):
block_id = self._allocate_one_block()
if block_id is None:
# Out of blocks: must evict
block_id = self._evict_one_block()
if block_id is None:
raise OutOfBlocksError(
f"Cannot allocate {new_blocks_needed} blocks, "
f"no evictable blocks available"
)
allocated.append(block_id)
return allocated
def insert_blocks(self, token_ids, block_ids, start_token_offset=0):
"""
Insert completed blocks into the prefix cache.
Called after prefill completes: the KV cache for these blocks
is now valid and can be reused by future requests.
"""
hasher = BlockHasher(self.block_size)
chain = hasher.compute_chain(token_ids)
for i, (block_hash, start, end) in enumerate(chain):
if i >= len(block_ids):
break
block_id = block_ids[i]
block = self.blocks[block_id]
block.block_hash = block_hash
block.token_ids = tuple(token_ids[start:end])
block.status = BlockStatus.ALLOCATED
block.ref_count = 1
# Insert into hash table
self.hash_table[block_hash] = block
def release_blocks(self, block_ids):
"""
Release blocks when a request completes.
Decrements ref_count. If ref_count reaches 0, the block
moves to the unreferenced LRU (eligible for eviction)
but is NOT freed -- it stays in the cache for potential reuse.
"""
for block_id in block_ids:
block = self.blocks[block_id]
block.ref_count -= 1
if block.ref_count == 0:
block.status = BlockStatus.CACHED
block.last_access_time = time.monotonic()
self.unreferenced_lru[block_id] = block
def _reference_block(self, block):
"""Increment ref_count when a request matches a cached block."""
block.ref_count += 1
block.last_access_time = time.monotonic()
block.status = BlockStatus.ALLOCATED
# Remove from unreferenced LRU (it is now referenced)
self.unreferenced_lru.pop(block.block_id, None)
def _allocate_one_block(self):
"""Allocate a free block. Returns block_id or None."""
if self.free_blocks:
return self.free_blocks.pop()
return None
def _evict_one_block(self):
"""
Evict the least recently used unreferenced block.
Only blocks with ref_count=0 are evictable.
Referenced blocks (in use by active requests) are never evicted.
"""
if not self.unreferenced_lru:
return None
# Pop the oldest (least recently used) entry
block_id, block = self.unreferenced_lru.popitem(last=False)
# Remove from hash table
if block.block_hash in self.hash_table:
del self.hash_table[block.block_hash]
# Reset block metadata
block.block_hash = None
block.token_ids = None
block.ref_count = 0
block.status = BlockStatus.FREE
self.stats.evictions += 1
return block_id
When no prefix is shared, lookup_prefix() performs hash table lookups, each hitting None immediately. For a 4,096-token prompt with , that is 256 lookups. Each lookup is a Python dict probe: ~50-100 nanoseconds. Total overhead: 12.8-25.6 microseconds. The prefill for 4,096 tokens takes ~80 milliseconds on H100. The cache lookup is of the prefill time. This is what “zero overhead” means: the cache cost is unmeasurable in practice.
Eviction Policy: Referenced vs Unreferenced Partitions
Two-Partition Design
The block pool is logically partitioned into three categories:
- Referenced blocks (ref_count > 0): in use by active requests. Never evictable.
- Unreferenced cached blocks (ref_count = 0, in hash table): completed requests’ KV blocks. Evictable by LRU.
- Free blocks (never allocated or fully evicted): available for immediate allocation.
class EvictionPolicy:
"""
Two-tier eviction:
1. First, allocate from free blocks (no eviction needed)
2. Then, evict from unreferenced cached blocks (LRU order)
3. Never touch referenced blocks
This ensures that active requests are never disrupted by
prefix caching, and cached prefixes are preserved as long
as possible.
"""
def __init__(self, prefix_cache):
self.cache = prefix_cache
def allocate_with_eviction(self, num_blocks_needed):
"""Allocate blocks, evicting if necessary."""
allocated = []
for _ in range(num_blocks_needed):
# Tier 1: Free blocks
if self.cache.free_blocks:
block_id = self.cache.free_blocks.pop()
allocated.append(block_id)
continue
# Tier 2: Evict unreferenced cached blocks (LRU)
if self.cache.unreferenced_lru:
block_id = self.cache._evict_one_block()
allocated.append(block_id)
continue
# Tier 3: No blocks available
# This means all blocks are referenced by active requests.
# The scheduler must preempt a request to free blocks.
raise OutOfBlocksError(
f"All {self.cache.num_blocks} blocks are referenced. "
f"Need to preempt requests to free blocks."
)
return allocated
def get_eviction_candidates(self, num_needed):
"""
Preview which blocks would be evicted without actually evicting them.
Used by the scheduler to estimate whether a new request can be admitted.
"""
available = len(self.cache.free_blocks) + len(self.cache.unreferenced_lru)
can_satisfy = min(num_needed, available)
candidates = []
# Free blocks first (no eviction needed)
free_count = min(num_needed, len(self.cache.free_blocks))
candidates.extend(self.cache.free_blocks[:free_count])
# Then LRU unreferenced blocks
remaining = num_needed - free_count
if remaining > 0:
lru_items = list(self.cache.unreferenced_lru.keys())[:remaining]
candidates.extend(lru_items)
return candidates, can_satisfy >= num_needed
Eviction Ordering Within the LRU
The unreferenced LRU is ordered by last access time. When a block transitions from referenced to unreferenced (request completes), it enters the LRU at the tail (most recently used). Eviction always removes from the head (least recently used).
There is a subtlety: when a cached block is re-referenced (a new request matches the same prefix), it is removed from the LRU entirely. When that request completes, the block re-enters the LRU at the tail. This means frequently-accessed prefixes naturally stay in the cache even under memory pressure, because they keep cycling through the referenced partition.
# Timeline example:
# t=0: Request A arrives with system prompt "You are a helpful assistant..."
# - Blocks 0-3 allocated, prefilled, inserted into cache, ref_count=1
# t=1: Request A completes
# - Blocks 0-3: ref_count=0, enter LRU at tail
# - LRU: [0, 1, 2, 3] (oldest to newest)
# t=2: Request B arrives with same system prompt
# - lookup_prefix() matches blocks 0-3
# - Blocks 0-3: ref_count=1, removed from LRU
# - LRU: [] (empty)
# t=3: Request C arrives with different prompt, needs 4 blocks
# - No cache hits. Free blocks 4-7 allocated.
# t=4: Request B completes
# - Blocks 0-3: ref_count=0, re-enter LRU at tail
# - LRU: [0, 1, 2, 3]
# t=5: Request C completes
# - Blocks 4-7: ref_count=0, enter LRU at tail
# - LRU: [0, 1, 2, 3, 4, 5, 6, 7]
# t=6: New request needs 6 blocks, only 2 free blocks remain
# - Allocate 2 free blocks
# - Evict 4 from LRU head: blocks 0, 1, 2, 3 (the system prompt!)
# - The system prompt cache is now gone. Next request pays full prefill.
The LRU policy can evict frequently-used prefixes during traffic bursts. If many diverse requests arrive between two system-prompt requests, the system prompt blocks get pushed to the LRU head and evicted. A frequency-aware eviction policy (LFU or ARC) would preserve hot prefixes better, but vLLM v1 uses pure LRU for simplicity and predictability. The hit rate data below shows this is sufficient for most production workloads.
Hit Rate by Workload Type
Measurement Methodology
Hit rate is measured as the fraction of prefix blocks that are found in the cache:
Prefix Cache Hit Rate by Workload Pattern
| Workload | Shared Prefix Length | Request Rate | Hit Rate | TTFT Reduction |
|---|---|---|---|---|
| Chatbot (fixed system prompt) | 512 tokens | 100 req/s | 92-97% | 78% |
| Multi-turn chat (system + history) | 1,024-4,096 tokens | 50 req/s | 80-95% | 65-85% |
| RAG (shared instructions, diverse docs) | 256 tokens shared, 2K unique | 200 req/s | 25-40% | 10-18% |
| Code completion (file context) | Variable, 500-8K tokens | 300 req/s | 15-35% | 8-25% |
| Batch summarization (diverse docs) | 64 tokens (instruction only) | 500 req/s | 8-15% | 2-5% |
| Random prompts (benchmark) | 0 tokens | 100 req/s | 0-2% | 0% |
TTFT with and without Prefix Caching (Llama 70B, TP=4, H100)
(ms)For the chatbot workload, prefix caching reduces TTFT by 4.5x. For multi-turn chat with long conversation histories, the reduction is up to 7.6x. The savings scale with prefix length because longer shared prefixes mean more prefill compute is skipped.
Hit Rate Optimization Strategies
class PrefixCacheOptimizer:
"""
Strategies to maximize prefix cache hit rate.
"""
def optimize_prompt_ordering(self, requests):
"""
Strategy 1: Reorder requests to maximize prefix sharing.
Sort requests by prompt prefix so that requests with the
same prefix are processed consecutively. This maximizes
the chance that the prefix is still in cache when the next
request with the same prefix arrives.
"""
# Compute prefix hash for each request (first N tokens)
prefix_length = 64 # Compare first 64 tokens
request_prefixes = []
for req in requests:
prefix = tuple(req.token_ids[:prefix_length])
request_prefixes.append((prefix, req))
# Sort by prefix (groups same-prefix requests together)
request_prefixes.sort(key=lambda x: x[0])
return [req for _, req in request_prefixes]
def estimate_cache_budget(self, workload_profile):
"""
Strategy 2: Size the cache based on working set.
The working set = number of unique prefixes * blocks per prefix.
Cache should hold at least the working set for high hit rate.
"""
unique_prefixes = workload_profile.num_unique_prefixes
avg_prefix_blocks = workload_profile.avg_prefix_length // 16
working_set_blocks = unique_prefixes * avg_prefix_blocks
# Add 20% headroom for LRU churn
recommended_blocks = int(working_set_blocks * 1.2)
# Convert to memory
bytes_per_block = workload_profile.block_size_bytes
recommended_memory_gb = recommended_blocks * bytes_per_block / (1024**3)
return recommended_blocks, recommended_memory_gb
def monitor_and_adjust(self, cache, interval_seconds=60):
"""
Strategy 3: Dynamic monitoring.
Track hit rate over time. If hit rate drops below threshold,
increase cache size (if memory permits) or alert operator.
"""
snapshot = cache.stats.snapshot()
hit_rate = snapshot.block_hits / max(
snapshot.block_hits + snapshot.block_misses, 1
)
eviction_rate = snapshot.evictions / max(interval_seconds, 1)
metrics = {
'hit_rate': hit_rate,
'eviction_rate_per_sec': eviction_rate,
'cache_utilization': len(cache.hash_table) / cache.num_blocks,
'unreferenced_fraction': len(cache.unreferenced_lru) / cache.num_blocks,
}
return metrics
The Full Prefix Cache Manager
Combining hash chains, lookup, allocation, eviction, and insertion into a single manager:
class PrefixCacheManager:
"""
Complete prefix cache manager for vLLM v1.
Lifecycle:
1. New request arrives: lookup_prefix() to find cached blocks
2. Allocate blocks for uncached portion
3. Prefill the uncached blocks (KV computation)
4. Insert newly prefilled blocks into cache
5. Decode until completion
6. Release all blocks (move to unreferenced LRU)
"""
def __init__(self, num_gpu_blocks, block_size=16):
self.block_size = block_size
self.cache = PrefixCache(num_gpu_blocks, block_size)
self.hasher = BlockHasher(block_size)
self.eviction = EvictionPolicy(self.cache)
# Per-request tracking: request_id -> list of block_ids
self.request_blocks = {}
def on_request_arrive(self, request_id, token_ids):
"""
Handle new request: find cached prefix, allocate remaining blocks.
Returns:
cached_token_count: number of tokens found in cache
all_block_ids: ordered list of block IDs for the full sequence
needs_prefill_from: token position to start prefill
"""
# Step 1: Find longest cached prefix
cached_tokens, cached_block_ids = self.cache.lookup_prefix(token_ids)
# Step 2: Allocate blocks for uncached portion
total_tokens = len(token_ids)
total_blocks = (total_tokens + self.block_size - 1) // self.block_size
new_blocks_needed = total_blocks - len(cached_block_ids)
new_block_ids = self.eviction.allocate_with_eviction(new_blocks_needed)
# Step 3: Combine cached + new blocks
all_block_ids = cached_block_ids + new_block_ids
self.request_blocks[request_id] = all_block_ids
return cached_tokens, all_block_ids, cached_tokens
def on_prefill_complete(self, request_id, token_ids):
"""
After prefill: insert new blocks into the cache.
Only blocks that were NOT cached need insertion.
"""
block_ids = self.request_blocks[request_id]
self.cache.insert_blocks(token_ids, block_ids)
def on_new_decode_block(self, request_id, block_position):
"""
During decode: allocate a new block when the current block is full.
Called when the sequence length crosses a block boundary.
"""
new_block_id = self.eviction.allocate_with_eviction(1)[0]
self.request_blocks[request_id].append(new_block_id)
return new_block_id
def on_request_complete(self, request_id):
"""
Request finished: release all blocks to the cache.
Blocks remain in the hash table for future reuse.
Their ref_count drops to 0, making them evictable.
"""
block_ids = self.request_blocks.pop(request_id)
self.cache.release_blocks(block_ids)
def on_request_preempted(self, request_id):
"""
Request preempted by scheduler (e.g., out of KV cache space).
Same as completion: release blocks. The KV data is still valid,
so the blocks stay in the cache. If the request is resumed later,
it can re-match its prefix from the cache.
"""
block_ids = self.request_blocks.pop(request_id)
self.cache.release_blocks(block_ids)
def get_stats(self):
"""Return cache statistics."""
stats = self.cache.stats
total = stats.block_hits + stats.block_misses
hit_rate = stats.block_hits / max(total, 1)
return {
'hit_rate': hit_rate,
'block_hits': stats.block_hits,
'block_misses': stats.block_misses,
'evictions': stats.evictions,
'collisions': stats.collisions,
'cached_blocks': len(self.cache.hash_table),
'free_blocks': len(self.cache.free_blocks),
'unreferenced_blocks': len(self.cache.unreferenced_lru),
'referenced_blocks': sum(
1 for b in self.cache.blocks.values() if b.ref_count > 0
),
}
class PrefixCacheStats:
"""Mutable statistics counters."""
def __init__(self):
self.block_hits = 0
self.block_misses = 0
self.evictions = 0
self.collisions = 0
def snapshot(self):
"""Return a copy of current stats."""
snap = PrefixCacheStats()
snap.block_hits = self.block_hits
snap.block_misses = self.block_misses
snap.evictions = self.evictions
snap.collisions = self.collisions
return snap
Integration with the v1 Block Manager
The prefix cache integrates with vLLM v1’s block manager at two points: allocation and the scheduler’s capacity check.
class V1BlockManagerWithPrefixCache:
"""
Extended block manager with prefix caching.
The block manager's can_allocate() and allocate() methods
are modified to account for cached blocks.
"""
def __init__(self, num_blocks, block_size, enable_prefix_cache=True):
self.num_blocks = num_blocks
self.block_size = block_size
if enable_prefix_cache:
self.prefix_cache = PrefixCacheManager(num_blocks, block_size)
else:
self.prefix_cache = None
def can_allocate(self, request):
"""
Check if a request can be allocated without preemption.
With prefix caching, the answer depends on:
1. How many blocks are already cached (free prefill)
2. How many free + evictable blocks are available
"""
total_blocks_needed = (
(len(request.token_ids) + self.block_size - 1) // self.block_size
)
if self.prefix_cache is not None:
cached_tokens, _ = self.prefix_cache.cache.lookup_prefix(
request.token_ids
)
cached_blocks = cached_tokens // self.block_size
new_blocks_needed = total_blocks_needed - cached_blocks
else:
new_blocks_needed = total_blocks_needed
available = (
len(self.prefix_cache.cache.free_blocks) +
len(self.prefix_cache.cache.unreferenced_lru)
if self.prefix_cache else self.num_blocks
)
return new_blocks_needed <= available
def allocate(self, request):
"""Allocate blocks for a request, using prefix cache if available."""
if self.prefix_cache is not None:
return self.prefix_cache.on_request_arrive(
request.request_id, request.token_ids
)
else:
# No prefix cache: allocate all blocks fresh
total_blocks = (
(len(request.token_ids) + self.block_size - 1) // self.block_size
)
block_ids = list(range(total_blocks)) # simplified
return 0, block_ids, 0
def free(self, request):
"""Release blocks for a completed/preempted request."""
if self.prefix_cache is not None:
self.prefix_cache.on_request_complete(request.request_id)
Measuring Cache Effectiveness
Prefix Length Distribution Analysis
class PrefixAnalyzer:
"""
Analyze workload to determine prefix caching effectiveness.
Run on a sample of requests to estimate potential savings.
"""
def __init__(self, block_size=16):
self.block_size = block_size
self.hasher = BlockHasher(block_size)
def analyze(self, request_log):
"""
Analyze a log of requests to compute prefix sharing statistics.
request_log: list of token_id lists (one per request)
"""
# Compute block hashes for all requests
all_chains = []
for token_ids in request_log:
chain = self.hasher.compute_prefix_hashes(token_ids)
all_chains.append(chain)
# Count hash occurrences
hash_counts = {}
for chain in all_chains:
for block_hash in chain:
h = block_hash.hex()
hash_counts[h] = hash_counts.get(h, 0) + 1
# Compute shared prefix statistics
shared_blocks = sum(1 for count in hash_counts.values() if count > 1)
total_blocks = sum(len(chain) for chain in all_chains)
unique_blocks = len(hash_counts)
# Compute potential savings
# If all shared blocks are cached, prefill compute savings:
reusable_blocks = sum(
count - 1 for count in hash_counts.values() if count > 1
)
potential_savings = reusable_blocks / max(total_blocks, 1)
return {
'total_requests': len(request_log),
'total_blocks': total_blocks,
'unique_blocks': unique_blocks,
'shared_blocks': shared_blocks,
'reusable_block_instances': reusable_blocks,
'potential_prefill_savings': potential_savings,
'avg_shared_prefix_length': (
reusable_blocks * self.block_size / max(len(request_log), 1)
),
}
Prefix Cache Memory Overhead
| Component | Per Block | 8,587 Blocks (45 GB KV pool) | Impact |
|---|---|---|---|
| CachedBlock metadata | 64 bytes | 537 KB | 0.001% of KV pool |
| Hash table entry | 96 bytes | 805 KB | 0.002% of KV pool |
| LRU linked list node | 24 bytes | 201 KB | negligible |
| Token ID storage (verification) | 64 bytes (16 x int32) | 537 KB | 0.001% of KV pool |
| Total metadata overhead | 248 bytes | 2.08 MB | 0.004% of KV pool |
The prefix cache adds 2 MB of CPU memory overhead for an 8,587-block KV pool. This is negligible. The GPU memory layout is unchanged: blocks are allocated identically whether prefix caching is enabled or not. The only difference is the allocation and eviction logic.
Summary
vLLM v1’s prefix caching achieves zero overhead at 0% hit rate through content-addressable hash chains and a simple hash table lookup. The design separates blocks into referenced (never evictable), unreferenced cached (LRU evictable), and free partitions. The hash chain construction ensures that block identity encodes the entire prefix, preventing false matches across different prompt contexts. For chatbot workloads with shared system prompts, hit rates of 92-97% reduce TTFT by 4-8x. The metadata overhead is 2 MB for a 45 GB KV cache pool. The prefix cache manager integrates cleanly with v1’s block manager, requiring modifications only to the allocation and free paths.