Part of Series vLLM v1 & Omni Internals 10 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

I’ve watched production chatbots send the same 512-token system prompt through the prefill pipeline 10,000 times in a single day. “You are a helpful assistant trained by OpenAI…” — compute the query projections, multiply by key projections, run softmax, weight the value projections, do this across 80 layers, 40 attention heads, for 512 tokens. Takes 80 milliseconds on an H100. Then the next user message arrives and we do it all over again. The system prompt never changes. The KV cache for those 512 tokens is identical every single time. Without prefix caching, we burn 80ms per request on redundant computation. With prefix caching, the second request that shares the same system prompt skips the prefill entirely and jumps straight to processing the user’s actual message. The savings compound: at 100 requests per second, prefix caching recovers 8 full seconds of GPU time every second — time that can process more requests or serve longer contexts.

Prefix caching reuses KV cache blocks from previous requests that share the same prompt prefix. A chatbot system prompt prepended to every user message is the canonical example: the system prompt’s KV cache is computed once and reused by all subsequent requests. Without prefix caching, every request pays the full prefill cost for the shared prefix. With prefix caching, the second request that shares the same system prompt skips the prefill for those tokens entirely.

vLLM v0 had prefix caching, but it imposed overhead even when the hit rate was zero. The cache lookup added latency to every block allocation, and the hash computation consumed CPU cycles regardless of whether any prefix was actually shared. vLLM v1 redesigns prefix caching with a zero-overhead-at-miss architecture: when no prefix is shared, the cache lookup degenerates to a hash table probe that costs under 100 nanoseconds per block. When prefixes are shared, the savings are proportional to the shared prefix length.

Block-Level KV Cache Review

vLLM’s KV cache is divided into fixed-size blocks, each holding BB tokens (default B=16B=16). For a sequence of length LL, the number of blocks is L/B\lceil L / B \rceil. Each block stores the K and V tensors for its BB tokens across all layers:

block_size=B×2×nlayers×nkv_heads×dhead×dtype_bytes\text{block\_size} = B \times 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{dtype\_bytes}

For Llama 70B with GQA-8 (8 KV heads, 128 head dim, 80 layers) in FP16:

=16×2×80×8×128×2=5,242,880 bytes=5.24 MB per block= 16 \times 2 \times 80 \times 8 \times 128 \times 2 = 5{,}242{,}880 \text{ bytes} = 5.24 \text{ MB per block}

On an H100 with 80 GB HBM, after the model weights (35 GB for INT4 quantized 70B), approximately 45 GB is available for KV cache, holding 45,000/5.24=8,587\lfloor 45{,}000 / 5.24 \rfloor = 8{,}587 blocks, or 8,587×16=137,3928{,}587 \times 16 = 137{,}392 tokens of cached context.

H100 Memory Layout: Llama 70B INT4 with KV Cache

Model Weights (INT4) 35 GB Static, loaded at startup
KV Cache Pool (8,587 blocks) 45 GB Dynamic, managed by block manager + prefix cache
Activations + Workspace ~0.5 GB Transient, per-iteration

Content-Addressable Hash Chains

The Hash Function

The core idea: each KV cache block is identified by the hash of its content (the token IDs it contains) chained with the hash of the preceding block. This creates a content-addressable chain where any prefix can be looked up by computing its block hashes.

import hashlib
import struct

class BlockHasher:
    """
    Compute content-addressable hashes for KV cache blocks.

    Each block's hash depends on:
    1. The token IDs in the block
    2. The hash of the previous block (chain)
    3. The block position index (prevents collisions between
       identical token sequences at different positions)

    Hash chain:
      h_0 = H(tokens[0:B], position=0, prev=INITIAL)
      h_1 = H(tokens[B:2B], position=1, prev=h_0)
      h_i = H(tokens[i*B:(i+1)*B], position=i, prev=h_{i-1})
    """

    INITIAL_HASH = b'\x00' * 32  # SHA-256 of empty

    def __init__(self, block_size=16):
        self.block_size = block_size

    def compute_chain(self, token_ids):
        """
        Compute the full hash chain for a token sequence.

        Returns: list of (block_hash, block_start, block_end)
        """
        num_blocks = (len(token_ids) + self.block_size - 1) // self.block_size
        chain = []
        prev_hash = self.INITIAL_HASH

        for i in range(num_blocks):
            start = i * self.block_size
            end = min(start + self.block_size, len(token_ids))
            block_tokens = token_ids[start:end]

            # Pad incomplete blocks with a sentinel value
            if len(block_tokens) < self.block_size:
                block_tokens = block_tokens + [-1] * (self.block_size - len(block_tokens))

            block_hash = self._hash_block(block_tokens, i, prev_hash)
            chain.append((block_hash, start, end))
            prev_hash = block_hash

        return chain

    def _hash_block(self, tokens, position, prev_hash):
        """
        Hash a single block.

        Using SHA-256 for collision resistance. In production,
        xxHash or CityHash is used for speed (collision risk is
        acceptable given the application).
        """
        h = hashlib.sha256()
        h.update(prev_hash)
        h.update(struct.pack('<I', position))
        for token in tokens:
            h.update(struct.pack('<i', token))
        return h.digest()

    def compute_prefix_hashes(self, token_ids):
        """
        Compute hashes only for complete blocks.

        Incomplete trailing block is not hashed (it may still
        be in the process of being filled during prefill).
        """
        num_complete_blocks = len(token_ids) // self.block_size
        chain = []
        prev_hash = self.INITIAL_HASH

        for i in range(num_complete_blocks):
            start = i * self.block_size
            end = start + self.block_size
            block_tokens = token_ids[start:end]

            block_hash = self._hash_block(block_tokens, i, prev_hash)
            chain.append(block_hash)
            prev_hash = block_hash

        return chain

Why Chaining Matters

Without chaining, two different prompts that happen to share the same 16 tokens at different positions would produce the same block hash, causing incorrect cache hits. Chaining ensures that a block’s hash encodes the entire prefix up to that block. Two blocks hash to the same value only if every preceding block also matches — which means the entire prefix is identical.

# Example: Why position matters

prompt_a = "The capital of France is Paris. The weather is"
prompt_b = "Paris. The weather is nice. The capital of France is"

# Without chaining, if "Paris. The weather is" appears in both,
# the block containing those tokens would have the same hash.
# With chaining, the hashes differ because the preceding blocks differ.

hasher = BlockHasher(block_size=16)
chain_a = hasher.compute_chain(tokenize(prompt_a))
chain_b = hasher.compute_chain(tokenize(prompt_b))

# chain_a[2] != chain_b[0] even if the tokens are identical,
# because the previous block hashes differ.

The Prefix Cache: Hash Table Design

from enum import Enum
from collections import OrderedDict
import time

class BlockStatus(Enum):
    FREE = 0         # Block is unallocated
    ALLOCATED = 1    # Block is in use by an active request
    CACHED = 2       # Block is in the prefix cache (no active reference)

class CachedBlock:
    """Metadata for a block in the prefix cache."""

    __slots__ = [
        'block_id', 'block_hash', 'ref_count',
        'last_access_time', 'status', 'prev_block', 'token_ids'
    ]

    def __init__(self, block_id, block_hash, token_ids):
        self.block_id = block_id        # Physical block index in KV pool
        self.block_hash = block_hash    # Content hash (from hash chain)
        self.ref_count = 0              # Number of active requests using this block
        self.last_access_time = time.monotonic()
        self.status = BlockStatus.FREE
        self.prev_block = None          # Previous block in chain (for chain validation)
        self.token_ids = token_ids      # Token content (for verification)

class PrefixCache:
    """
    Content-addressable prefix cache for KV blocks.

    Design goals:
    1. O(1) lookup per block (hash table probe)
    2. Zero overhead when hit rate is 0% (just the probe cost)
    3. LRU eviction within unreferenced blocks
    4. Reference counting for safe eviction
    """

    def __init__(self, num_blocks, block_size=16):
        self.num_blocks = num_blocks
        self.block_size = block_size

        # Hash table: block_hash -> CachedBlock
        self.hash_table = {}

        # Block pool: block_id -> CachedBlock
        self.blocks = {}
        for i in range(num_blocks):
            block = CachedBlock(i, None, None)
            block.status = BlockStatus.FREE
            self.blocks[i] = block

        # Free list: unallocated blocks (O(1) allocation)
        self.free_blocks = list(range(num_blocks))

        # Unreferenced LRU: cached blocks with ref_count=0
        # OrderedDict for O(1) LRU access
        self.unreferenced_lru = OrderedDict()

        # Statistics
        self.stats = PrefixCacheStats()

    def lookup_prefix(self, token_ids):
        """
        Find the longest cached prefix for a token sequence.

        Returns: (num_cached_tokens, list of matched block_ids)

        This is the hot path for prefix caching.
        Cost: O(num_blocks_in_prefix) hash table lookups.
        """
        hasher = BlockHasher(self.block_size)
        block_hashes = hasher.compute_prefix_hashes(token_ids)

        matched_blocks = []
        for block_hash in block_hashes:
            cached = self.hash_table.get(block_hash)

            if cached is None:
                # Cache miss: prefix ends here
                break

            # Verify: check that block content matches
            # (defense against hash collisions)
            expected_start = len(matched_blocks) * self.block_size
            expected_tokens = token_ids[expected_start:expected_start + self.block_size]

            if cached.token_ids != tuple(expected_tokens):
                # Hash collision: treat as miss
                self.stats.collisions += 1
                break

            # Cache hit: reference this block
            self._reference_block(cached)
            matched_blocks.append(cached.block_id)
            self.stats.block_hits += 1

        self.stats.block_misses += len(block_hashes) - len(matched_blocks)
        num_cached_tokens = len(matched_blocks) * self.block_size

        return num_cached_tokens, matched_blocks

    def allocate_blocks(self, token_ids, num_cached_tokens):
        """
        Allocate blocks for the uncached portion of a request.

        The first `num_cached_tokens / block_size` blocks are already
        matched (from lookup_prefix). This allocates blocks for the rest.

        Returns: list of newly allocated block_ids
        """
        total_blocks_needed = (len(token_ids) + self.block_size - 1) // self.block_size
        cached_blocks = num_cached_tokens // self.block_size
        new_blocks_needed = total_blocks_needed - cached_blocks

        allocated = []
        for _ in range(new_blocks_needed):
            block_id = self._allocate_one_block()
            if block_id is None:
                # Out of blocks: must evict
                block_id = self._evict_one_block()
                if block_id is None:
                    raise OutOfBlocksError(
                        f"Cannot allocate {new_blocks_needed} blocks, "
                        f"no evictable blocks available"
                    )

            allocated.append(block_id)

        return allocated

    def insert_blocks(self, token_ids, block_ids, start_token_offset=0):
        """
        Insert completed blocks into the prefix cache.

        Called after prefill completes: the KV cache for these blocks
        is now valid and can be reused by future requests.
        """
        hasher = BlockHasher(self.block_size)
        chain = hasher.compute_chain(token_ids)

        for i, (block_hash, start, end) in enumerate(chain):
            if i >= len(block_ids):
                break

            block_id = block_ids[i]
            block = self.blocks[block_id]

            block.block_hash = block_hash
            block.token_ids = tuple(token_ids[start:end])
            block.status = BlockStatus.ALLOCATED
            block.ref_count = 1

            # Insert into hash table
            self.hash_table[block_hash] = block

    def release_blocks(self, block_ids):
        """
        Release blocks when a request completes.

        Decrements ref_count. If ref_count reaches 0, the block
        moves to the unreferenced LRU (eligible for eviction)
        but is NOT freed -- it stays in the cache for potential reuse.
        """
        for block_id in block_ids:
            block = self.blocks[block_id]
            block.ref_count -= 1

            if block.ref_count == 0:
                block.status = BlockStatus.CACHED
                block.last_access_time = time.monotonic()
                self.unreferenced_lru[block_id] = block

    def _reference_block(self, block):
        """Increment ref_count when a request matches a cached block."""
        block.ref_count += 1
        block.last_access_time = time.monotonic()
        block.status = BlockStatus.ALLOCATED

        # Remove from unreferenced LRU (it is now referenced)
        self.unreferenced_lru.pop(block.block_id, None)

    def _allocate_one_block(self):
        """Allocate a free block. Returns block_id or None."""
        if self.free_blocks:
            return self.free_blocks.pop()
        return None

    def _evict_one_block(self):
        """
        Evict the least recently used unreferenced block.

        Only blocks with ref_count=0 are evictable.
        Referenced blocks (in use by active requests) are never evicted.
        """
        if not self.unreferenced_lru:
            return None

        # Pop the oldest (least recently used) entry
        block_id, block = self.unreferenced_lru.popitem(last=False)

        # Remove from hash table
        if block.block_hash in self.hash_table:
            del self.hash_table[block.block_hash]

        # Reset block metadata
        block.block_hash = None
        block.token_ids = None
        block.ref_count = 0
        block.status = BlockStatus.FREE

        self.stats.evictions += 1
        return block_id
ℹ️ Zero Overhead at 0% Hit Rate

When no prefix is shared, lookup_prefix() performs L/B\lceil L / B \rceil hash table lookups, each hitting None immediately. For a 4,096-token prompt with B=16B=16, that is 256 lookups. Each lookup is a Python dict probe: ~50-100 nanoseconds. Total overhead: 12.8-25.6 microseconds. The prefill for 4,096 tokens takes ~80 milliseconds on H100. The cache lookup is 0.032%0.032\% of the prefill time. This is what “zero overhead” means: the cache cost is unmeasurable in practice.

Eviction Policy: Referenced vs Unreferenced Partitions

Two-Partition Design

The block pool is logically partitioned into three categories:

  1. Referenced blocks (ref_count > 0): in use by active requests. Never evictable.
  2. Unreferenced cached blocks (ref_count = 0, in hash table): completed requests’ KV blocks. Evictable by LRU.
  3. Free blocks (never allocated or fully evicted): available for immediate allocation.
class EvictionPolicy:
    """
    Two-tier eviction:
    1. First, allocate from free blocks (no eviction needed)
    2. Then, evict from unreferenced cached blocks (LRU order)
    3. Never touch referenced blocks

    This ensures that active requests are never disrupted by
    prefix caching, and cached prefixes are preserved as long
    as possible.
    """

    def __init__(self, prefix_cache):
        self.cache = prefix_cache

    def allocate_with_eviction(self, num_blocks_needed):
        """Allocate blocks, evicting if necessary."""
        allocated = []

        for _ in range(num_blocks_needed):
            # Tier 1: Free blocks
            if self.cache.free_blocks:
                block_id = self.cache.free_blocks.pop()
                allocated.append(block_id)
                continue

            # Tier 2: Evict unreferenced cached blocks (LRU)
            if self.cache.unreferenced_lru:
                block_id = self.cache._evict_one_block()
                allocated.append(block_id)
                continue

            # Tier 3: No blocks available
            # This means all blocks are referenced by active requests.
            # The scheduler must preempt a request to free blocks.
            raise OutOfBlocksError(
                f"All {self.cache.num_blocks} blocks are referenced. "
                f"Need to preempt requests to free blocks."
            )

        return allocated

    def get_eviction_candidates(self, num_needed):
        """
        Preview which blocks would be evicted without actually evicting them.
        Used by the scheduler to estimate whether a new request can be admitted.
        """
        available = len(self.cache.free_blocks) + len(self.cache.unreferenced_lru)
        can_satisfy = min(num_needed, available)

        candidates = []
        # Free blocks first (no eviction needed)
        free_count = min(num_needed, len(self.cache.free_blocks))
        candidates.extend(self.cache.free_blocks[:free_count])

        # Then LRU unreferenced blocks
        remaining = num_needed - free_count
        if remaining > 0:
            lru_items = list(self.cache.unreferenced_lru.keys())[:remaining]
            candidates.extend(lru_items)

        return candidates, can_satisfy >= num_needed

Eviction Ordering Within the LRU

The unreferenced LRU is ordered by last access time. When a block transitions from referenced to unreferenced (request completes), it enters the LRU at the tail (most recently used). Eviction always removes from the head (least recently used).

There is a subtlety: when a cached block is re-referenced (a new request matches the same prefix), it is removed from the LRU entirely. When that request completes, the block re-enters the LRU at the tail. This means frequently-accessed prefixes naturally stay in the cache even under memory pressure, because they keep cycling through the referenced partition.

# Timeline example:
# t=0: Request A arrives with system prompt "You are a helpful assistant..."
#       - Blocks 0-3 allocated, prefilled, inserted into cache, ref_count=1
# t=1: Request A completes
#       - Blocks 0-3: ref_count=0, enter LRU at tail
#       - LRU: [0, 1, 2, 3] (oldest to newest)
# t=2: Request B arrives with same system prompt
#       - lookup_prefix() matches blocks 0-3
#       - Blocks 0-3: ref_count=1, removed from LRU
#       - LRU: [] (empty)
# t=3: Request C arrives with different prompt, needs 4 blocks
#       - No cache hits. Free blocks 4-7 allocated.
# t=4: Request B completes
#       - Blocks 0-3: ref_count=0, re-enter LRU at tail
#       - LRU: [0, 1, 2, 3]
# t=5: Request C completes
#       - Blocks 4-7: ref_count=0, enter LRU at tail
#       - LRU: [0, 1, 2, 3, 4, 5, 6, 7]
# t=6: New request needs 6 blocks, only 2 free blocks remain
#       - Allocate 2 free blocks
#       - Evict 4 from LRU head: blocks 0, 1, 2, 3 (the system prompt!)
#       - The system prompt cache is now gone. Next request pays full prefill.
⚠️ Eviction of Hot Prefixes

The LRU policy can evict frequently-used prefixes during traffic bursts. If many diverse requests arrive between two system-prompt requests, the system prompt blocks get pushed to the LRU head and evicted. A frequency-aware eviction policy (LFU or ARC) would preserve hot prefixes better, but vLLM v1 uses pure LRU for simplicity and predictability. The hit rate data below shows this is sufficient for most production workloads.

Hit Rate by Workload Type

Measurement Methodology

Hit rate is measured as the fraction of prefix blocks that are found in the cache:

hit_rate=requestscached_blocksrequeststotal_prefix_blocks\text{hit\_rate} = \frac{\sum_{\text{requests}} \text{cached\_blocks}}{\sum_{\text{requests}} \text{total\_prefix\_blocks}}

📊

Prefix Cache Hit Rate by Workload Pattern

WorkloadShared Prefix LengthRequest RateHit RateTTFT Reduction
Chatbot (fixed system prompt) 512 tokens 100 req/s 92-97% 78%
Multi-turn chat (system + history) 1,024-4,096 tokens 50 req/s 80-95% 65-85%
RAG (shared instructions, diverse docs) 256 tokens shared, 2K unique 200 req/s 25-40% 10-18%
Code completion (file context) Variable, 500-8K tokens 300 req/s 15-35% 8-25%
Batch summarization (diverse docs) 64 tokens (instruction only) 500 req/s 8-15% 2-5%
Random prompts (benchmark) 0 tokens 100 req/s 0-2% 0%
Note: Hit rate depends on prefix overlap between consecutive requests. TTFT reduction = (cached tokens / total prompt tokens) * hit_rate.

TTFT with and without Prefix Caching (Llama 70B, TP=4, H100)

(ms)
No cache, 512 prefix 82 ms
82 ms
Cached, 512 prefix (97% hit) 18 ms
18 ms
No cache, 2K prefix 245 ms
245 ms
+198.8%
Cached, 2K prefix (90% hit) 32 ms
32 ms
No cache, 8K prefix 890 ms
890 ms
+985.4%
Cached, 8K prefix (85% hit) 145 ms
145 ms
+76.8%

For the chatbot workload, prefix caching reduces TTFT by 4.5x. For multi-turn chat with long conversation histories, the reduction is up to 7.6x. The savings scale with prefix length because longer shared prefixes mean more prefill compute is skipped.

Hit Rate Optimization Strategies

class PrefixCacheOptimizer:
    """
    Strategies to maximize prefix cache hit rate.
    """

    def optimize_prompt_ordering(self, requests):
        """
        Strategy 1: Reorder requests to maximize prefix sharing.

        Sort requests by prompt prefix so that requests with the
        same prefix are processed consecutively. This maximizes
        the chance that the prefix is still in cache when the next
        request with the same prefix arrives.
        """
        # Compute prefix hash for each request (first N tokens)
        prefix_length = 64  # Compare first 64 tokens
        request_prefixes = []

        for req in requests:
            prefix = tuple(req.token_ids[:prefix_length])
            request_prefixes.append((prefix, req))

        # Sort by prefix (groups same-prefix requests together)
        request_prefixes.sort(key=lambda x: x[0])
        return [req for _, req in request_prefixes]

    def estimate_cache_budget(self, workload_profile):
        """
        Strategy 2: Size the cache based on working set.

        The working set = number of unique prefixes * blocks per prefix.
        Cache should hold at least the working set for high hit rate.
        """
        unique_prefixes = workload_profile.num_unique_prefixes
        avg_prefix_blocks = workload_profile.avg_prefix_length // 16

        working_set_blocks = unique_prefixes * avg_prefix_blocks

        # Add 20% headroom for LRU churn
        recommended_blocks = int(working_set_blocks * 1.2)

        # Convert to memory
        bytes_per_block = workload_profile.block_size_bytes
        recommended_memory_gb = recommended_blocks * bytes_per_block / (1024**3)

        return recommended_blocks, recommended_memory_gb

    def monitor_and_adjust(self, cache, interval_seconds=60):
        """
        Strategy 3: Dynamic monitoring.

        Track hit rate over time. If hit rate drops below threshold,
        increase cache size (if memory permits) or alert operator.
        """
        snapshot = cache.stats.snapshot()

        hit_rate = snapshot.block_hits / max(
            snapshot.block_hits + snapshot.block_misses, 1
        )
        eviction_rate = snapshot.evictions / max(interval_seconds, 1)

        metrics = {
            'hit_rate': hit_rate,
            'eviction_rate_per_sec': eviction_rate,
            'cache_utilization': len(cache.hash_table) / cache.num_blocks,
            'unreferenced_fraction': len(cache.unreferenced_lru) / cache.num_blocks,
        }

        return metrics

The Full Prefix Cache Manager

Combining hash chains, lookup, allocation, eviction, and insertion into a single manager:

class PrefixCacheManager:
    """
    Complete prefix cache manager for vLLM v1.

    Lifecycle:
    1. New request arrives: lookup_prefix() to find cached blocks
    2. Allocate blocks for uncached portion
    3. Prefill the uncached blocks (KV computation)
    4. Insert newly prefilled blocks into cache
    5. Decode until completion
    6. Release all blocks (move to unreferenced LRU)
    """

    def __init__(self, num_gpu_blocks, block_size=16):
        self.block_size = block_size
        self.cache = PrefixCache(num_gpu_blocks, block_size)
        self.hasher = BlockHasher(block_size)
        self.eviction = EvictionPolicy(self.cache)

        # Per-request tracking: request_id -> list of block_ids
        self.request_blocks = {}

    def on_request_arrive(self, request_id, token_ids):
        """
        Handle new request: find cached prefix, allocate remaining blocks.

        Returns:
            cached_token_count: number of tokens found in cache
            all_block_ids: ordered list of block IDs for the full sequence
            needs_prefill_from: token position to start prefill
        """
        # Step 1: Find longest cached prefix
        cached_tokens, cached_block_ids = self.cache.lookup_prefix(token_ids)

        # Step 2: Allocate blocks for uncached portion
        total_tokens = len(token_ids)
        total_blocks = (total_tokens + self.block_size - 1) // self.block_size
        new_blocks_needed = total_blocks - len(cached_block_ids)

        new_block_ids = self.eviction.allocate_with_eviction(new_blocks_needed)

        # Step 3: Combine cached + new blocks
        all_block_ids = cached_block_ids + new_block_ids
        self.request_blocks[request_id] = all_block_ids

        return cached_tokens, all_block_ids, cached_tokens

    def on_prefill_complete(self, request_id, token_ids):
        """
        After prefill: insert new blocks into the cache.

        Only blocks that were NOT cached need insertion.
        """
        block_ids = self.request_blocks[request_id]
        self.cache.insert_blocks(token_ids, block_ids)

    def on_new_decode_block(self, request_id, block_position):
        """
        During decode: allocate a new block when the current block is full.

        Called when the sequence length crosses a block boundary.
        """
        new_block_id = self.eviction.allocate_with_eviction(1)[0]
        self.request_blocks[request_id].append(new_block_id)
        return new_block_id

    def on_request_complete(self, request_id):
        """
        Request finished: release all blocks to the cache.

        Blocks remain in the hash table for future reuse.
        Their ref_count drops to 0, making them evictable.
        """
        block_ids = self.request_blocks.pop(request_id)
        self.cache.release_blocks(block_ids)

    def on_request_preempted(self, request_id):
        """
        Request preempted by scheduler (e.g., out of KV cache space).

        Same as completion: release blocks. The KV data is still valid,
        so the blocks stay in the cache. If the request is resumed later,
        it can re-match its prefix from the cache.
        """
        block_ids = self.request_blocks.pop(request_id)
        self.cache.release_blocks(block_ids)

    def get_stats(self):
        """Return cache statistics."""
        stats = self.cache.stats
        total = stats.block_hits + stats.block_misses
        hit_rate = stats.block_hits / max(total, 1)

        return {
            'hit_rate': hit_rate,
            'block_hits': stats.block_hits,
            'block_misses': stats.block_misses,
            'evictions': stats.evictions,
            'collisions': stats.collisions,
            'cached_blocks': len(self.cache.hash_table),
            'free_blocks': len(self.cache.free_blocks),
            'unreferenced_blocks': len(self.cache.unreferenced_lru),
            'referenced_blocks': sum(
                1 for b in self.cache.blocks.values() if b.ref_count > 0
            ),
        }

class PrefixCacheStats:
    """Mutable statistics counters."""

    def __init__(self):
        self.block_hits = 0
        self.block_misses = 0
        self.evictions = 0
        self.collisions = 0

    def snapshot(self):
        """Return a copy of current stats."""
        snap = PrefixCacheStats()
        snap.block_hits = self.block_hits
        snap.block_misses = self.block_misses
        snap.evictions = self.evictions
        snap.collisions = self.collisions
        return snap

Integration with the v1 Block Manager

The prefix cache integrates with vLLM v1’s block manager at two points: allocation and the scheduler’s capacity check.

class V1BlockManagerWithPrefixCache:
    """
    Extended block manager with prefix caching.

    The block manager's can_allocate() and allocate() methods
    are modified to account for cached blocks.
    """

    def __init__(self, num_blocks, block_size, enable_prefix_cache=True):
        self.num_blocks = num_blocks
        self.block_size = block_size

        if enable_prefix_cache:
            self.prefix_cache = PrefixCacheManager(num_blocks, block_size)
        else:
            self.prefix_cache = None

    def can_allocate(self, request):
        """
        Check if a request can be allocated without preemption.

        With prefix caching, the answer depends on:
        1. How many blocks are already cached (free prefill)
        2. How many free + evictable blocks are available
        """
        total_blocks_needed = (
            (len(request.token_ids) + self.block_size - 1) // self.block_size
        )

        if self.prefix_cache is not None:
            cached_tokens, _ = self.prefix_cache.cache.lookup_prefix(
                request.token_ids
            )
            cached_blocks = cached_tokens // self.block_size
            new_blocks_needed = total_blocks_needed - cached_blocks
        else:
            new_blocks_needed = total_blocks_needed

        available = (
            len(self.prefix_cache.cache.free_blocks) +
            len(self.prefix_cache.cache.unreferenced_lru)
            if self.prefix_cache else self.num_blocks
        )

        return new_blocks_needed <= available

    def allocate(self, request):
        """Allocate blocks for a request, using prefix cache if available."""
        if self.prefix_cache is not None:
            return self.prefix_cache.on_request_arrive(
                request.request_id, request.token_ids
            )
        else:
            # No prefix cache: allocate all blocks fresh
            total_blocks = (
                (len(request.token_ids) + self.block_size - 1) // self.block_size
            )
            block_ids = list(range(total_blocks))  # simplified
            return 0, block_ids, 0

    def free(self, request):
        """Release blocks for a completed/preempted request."""
        if self.prefix_cache is not None:
            self.prefix_cache.on_request_complete(request.request_id)

Measuring Cache Effectiveness

Prefix Length Distribution Analysis

class PrefixAnalyzer:
    """
    Analyze workload to determine prefix caching effectiveness.

    Run on a sample of requests to estimate potential savings.
    """

    def __init__(self, block_size=16):
        self.block_size = block_size
        self.hasher = BlockHasher(block_size)

    def analyze(self, request_log):
        """
        Analyze a log of requests to compute prefix sharing statistics.

        request_log: list of token_id lists (one per request)
        """
        # Compute block hashes for all requests
        all_chains = []
        for token_ids in request_log:
            chain = self.hasher.compute_prefix_hashes(token_ids)
            all_chains.append(chain)

        # Count hash occurrences
        hash_counts = {}
        for chain in all_chains:
            for block_hash in chain:
                h = block_hash.hex()
                hash_counts[h] = hash_counts.get(h, 0) + 1

        # Compute shared prefix statistics
        shared_blocks = sum(1 for count in hash_counts.values() if count > 1)
        total_blocks = sum(len(chain) for chain in all_chains)
        unique_blocks = len(hash_counts)

        # Compute potential savings
        # If all shared blocks are cached, prefill compute savings:
        reusable_blocks = sum(
            count - 1 for count in hash_counts.values() if count > 1
        )
        potential_savings = reusable_blocks / max(total_blocks, 1)

        return {
            'total_requests': len(request_log),
            'total_blocks': total_blocks,
            'unique_blocks': unique_blocks,
            'shared_blocks': shared_blocks,
            'reusable_block_instances': reusable_blocks,
            'potential_prefill_savings': potential_savings,
            'avg_shared_prefix_length': (
                reusable_blocks * self.block_size / max(len(request_log), 1)
            ),
        }
📊

Prefix Cache Memory Overhead

ComponentPer Block8,587 Blocks (45 GB KV pool)Impact
CachedBlock metadata 64 bytes 537 KB 0.001% of KV pool
Hash table entry 96 bytes 805 KB 0.002% of KV pool
LRU linked list node 24 bytes 201 KB negligible
Token ID storage (verification) 64 bytes (16 x int32) 537 KB 0.001% of KV pool
Total metadata overhead 248 bytes 2.08 MB 0.004% of KV pool
Note: The prefix cache metadata is stored in CPU memory, not GPU HBM. The KV cache blocks themselves are on the GPU.

The prefix cache adds 2 MB of CPU memory overhead for an 8,587-block KV pool. This is negligible. The GPU memory layout is unchanged: blocks are allocated identically whether prefix caching is enabled or not. The only difference is the allocation and eviction logic.

Summary

vLLM v1’s prefix caching achieves zero overhead at 0% hit rate through content-addressable hash chains and a simple hash table lookup. The design separates blocks into referenced (never evictable), unreferenced cached (LRU evictable), and free partitions. The hash chain construction ensures that block identity encodes the entire prefix, preventing false matches across different prompt contexts. For chatbot workloads with shared system prompts, hit rates of 92-97% reduce TTFT by 4-8x. The metadata overhead is 2 MB for a 45 GB KV cache pool. The prefix cache manager integrates cleanly with v1’s block manager, requiring modifications only to the allocation and free paths.