Every LLM serving system faces the same bottleneck: GPU memory. Not compute, not network bandwidth — memory. Specifically, the KV cache that stores intermediate attention state for every in-flight request consumes the majority of available GPU HBM during inference. Before PagedAttention, the standard approach was to pre-allocate a contiguous chunk of memory for each request sized to the maximum possible sequence length. The result was catastrophic waste — on average, only 32% of allocated KV cache memory held actual data. The other 68% sat empty, reserving space for tokens that would never be generated.
vLLM’s PagedAttention, introduced by Kwon et al. in 2023, solved this by applying an idea that operating systems perfected decades ago: virtual memory with demand paging. Instead of pre-allocating contiguous memory, PagedAttention divides GPU HBM into fixed-size blocks and allocates them on demand as tokens are generated. The result is near-zero internal fragmentation, dramatically higher memory utilization, and 2-4x throughput improvements from fitting more concurrent requests into the same GPU.
This post dissects PagedAttention from the ground up: the memory waste problem it solves, the OS virtual memory analogy that inspired it, the block allocation mechanics that make it work, the indirection cost you pay for it, and the advanced techniques (copy-on-write, prefix caching, preemption) that build on top of it. We will also examine when PagedAttention’s overhead is not worth it.
The Memory Waste Crisis in LLM Serving
To understand why PagedAttention matters, you need to understand how KV cache memory was managed before it existed.
How Traditional KV Cache Allocation Works
During autoregressive generation, a transformer model computes key and value vectors at every layer for every token in the sequence. These vectors are cached so they do not need to be recomputed on subsequent generation steps. For a model with layers, attention heads, and head dimension , the KV cache for a single token requires:
For Llama-2-70B with 80 layers, 8 KV heads (GQA), head dimension 128, and FP16 storage:
For a sequence of 2048 tokens, that is per request. On an A100 with 80 GB of HBM, after loading the model weights (~35 GB for FP16 Llama-70B), you have roughly 40-42 GB left for KV cache. That means you can serve at most ~64 concurrent requests at max sequence length.
The traditional approach pre-allocates this entire 640 MB block per request at the time of admission:
class TraditionalKVCache:
def __init__(self, max_seq_len: int = 2048, num_layers: int = 80,
num_kv_heads: int = 8, head_dim: int = 128):
# Pre-allocate for MAXIMUM sequence length, regardless of actual length
self.k_cache = torch.zeros(
num_layers, max_seq_len, num_kv_heads, head_dim,
dtype=torch.float16, device='cuda'
) # 320 MB
self.v_cache = torch.zeros(
num_layers, max_seq_len, num_kv_heads, head_dim,
dtype=torch.float16, device='cuda'
) # 320 MB
self.current_length = 0
def append_token(self, layer_idx: int, k: torch.Tensor, v: torch.Tensor):
self.k_cache[layer_idx, self.current_length] = k
self.v_cache[layer_idx, self.current_length] = v
self.current_length += 1
The problem is obvious: if a request generates only 200 tokens, 90% of the allocation is wasted. But the system cannot know in advance how long a sequence will be — it depends on the model’s output, which is determined token by token.
Quantifying the Waste
In production LLM serving workloads, sequence lengths follow a heavy-tailed distribution. Most requests are short (chatbot responses, summaries), but max_seq_len must accommodate the longest possible output. Empirical measurements from the vLLM paper show:
KV Cache Memory Utilization in Traditional Allocation
| Workload | Avg Seq Len | Max Seq Len | Avg Utilization | Memory Wasted |
|---|---|---|---|---|
| Chatbot (ShareGPT) | 215 tokens | 2048 tokens | 28% | 72% |
| Code Generation | 412 tokens | 4096 tokens | 32% | 68% |
| Summarization | 524 tokens | 2048 tokens | 38% | 62% |
| Translation | 186 tokens | 1024 tokens | 30% | 70% |
| Weighted Average | - | - | 32% | 68% |
The average utilization across workloads is approximately 32%. That means 68% of GPU memory dedicated to KV cache is wasted on empty reservations.
If you can only fit 20 requests in memory but 15 of them are using 30% of their allocation, you have enough actual KV data for ~6 requests worth — yet 20 slots are occupied. The 14 requests that could have been batched concurrently are stuck in a queue, waiting. Memory waste translates directly into throughput loss.
The Two Types of Fragmentation
Traditional KV cache allocation suffers from both internal and external fragmentation, mirroring the classic memory management problems studied in operating systems:
Internal fragmentation occurs within each allocation. A request allocated 2048 tokens worth of memory but only using 300 tokens has tokens of internal fragmentation. This is the dominant source of waste, accounting for roughly 60-70% of total memory loss.
External fragmentation occurs between allocations. When requests complete at different times, they leave gaps in the contiguous memory space. A newly arriving request needing 2048 contiguous tokens may find that the total free memory is sufficient but no single contiguous block is large enough. This accounts for another 10-20% of waste, depending on the allocation pattern.
Reservation fragmentation is a third, subtler issue. Even before a request begins generating tokens, the system must reserve memory for the maximum possible output. If the scheduler reserves max_seq_len tokens for 50 requests but only 30 could actually fit if their true lengths were known, 20 requests are unnecessarily delayed.
Memory Waste Breakdown (A100-80GB, Llama-70B, 50 concurrent requests)
(%)The OS Virtual Memory Analogy
The insight behind PagedAttention is that KV cache management maps almost perfectly onto the virtual memory problem that operating systems solved in the 1960s. Let us trace this analogy precisely.
Virtual Memory in Operating Systems
In a modern OS, each process sees a contiguous virtual address space. The hardware (MMU) and OS kernel collaborate to map this virtual space onto physical RAM in fixed-size units called pages (typically 4 KB). Key properties:
- No pre-allocation: A process can declare a large virtual address space without consuming physical memory. Pages are allocated on demand when first accessed (demand paging).
- Non-contiguous physical storage: Virtual pages 0, 1, 2 of a process might map to physical frames 847, 12, 503. The process sees a contiguous sequence; the hardware handles the indirection.
- Page table: A per-process data structure maps virtual page numbers to physical frame numbers. The MMU consults this on every memory access.
- Fragmentation elimination: Since all allocation units are the same fixed size, external fragmentation is impossible. Internal fragmentation is limited to the last page (at most
page_size - 1bytes per allocation).
The PagedAttention Mapping
PagedAttention applies this exact scheme to KV cache management:
| OS Virtual Memory | PagedAttention KV Cache |
|---|---|
| Physical RAM | GPU HBM (block pool) |
| Page frame (4 KB) | KV block (16 tokens of KV data) |
| Virtual address space | Logical KV cache per request |
| Page table (per process) | Block table (per sequence) |
| MMU translation | Attention kernel block table lookup |
| Demand paging | Allocate blocks as tokens are generated |
| Page fault handler | Block allocator (pop from free list) |
| Swapping to disk | Swapping KV blocks to CPU memory |
| Copy-on-write fork | Shared prefix blocks in beam search |
PagedAttention GPU HBM Layout (A100-80GB, Llama-70B)
0x0FFFFFFF 0x00000000 0x1FFFFFFF 0x10000000 0x207FFFFF 0x20000000 0x20FFFFFF 0x20800000 ~35 GB ~40 GB ~128 MB ~512 MB The crucial difference from OS virtual memory: there is no hardware MMU on the GPU performing address translation. The indirection must be done in software, inside the attention kernel itself. This is the fundamental cost of PagedAttention — every attention computation must dereference the block table. We will quantify this cost in a later section.
How It Eliminates Waste
With PagedAttention, a new request starts with zero allocated blocks. As the prefill phase processes prompt tokens, blocks are allocated to hold the resulting KV data. During decode, each time the sequence grows past a block boundary, one new block is allocated. When the request completes, all its blocks are returned to the free list.
The only waste is in the last block of each sequence, which may be partially filled. With a block size of 16 tokens, the average waste per request is 8 tokens — that is for Llama-70B. Compare this to the traditional approach where average waste is per request.
For max_seq_len = 2048, avg_seq_len = 300, block_size = 16:
- Traditional:
- PagedAttention:
That is a 109x reduction in per-request memory waste.
Block Allocation Mechanics
The block allocator is the core data structure of PagedAttention. Its design is deceptively simple but carefully optimized for the constraints of GPU serving.
The Free Block List
The allocator maintains a free list implemented as a stack (LIFO). This is a deliberate choice:
from typing import List, Dict, Optional
class BlockAllocator:
"""GPU KV cache block allocator. O(1) alloc and free."""
def __init__(self, num_blocks: int, block_size: int, device: str = 'cuda'):
self.num_blocks = num_blocks
self.block_size = block_size
self.device = device
# Free list as a stack: most recently freed blocks are reused first.
# LIFO order improves GPU L2 cache hit rates because recently-active
# blocks are "warm" in the cache hierarchy.
self.free_list: List[int] = list(range(num_blocks))
def allocate(self) -> Optional[int]:
"""Pop a block from the free list. O(1)."""
if not self.free_list:
return None # Out of memory — trigger preemption
return self.free_list.pop()
def free(self, block_id: int) -> None:
"""Push a block back onto the free list. O(1)."""
self.free_list.append(block_id)
def num_free_blocks(self) -> int:
return len(self.free_list)
A stack (LIFO) reuses the most recently freed blocks first. On GPUs, the L2 cache is typically 40-50 MB (A100: 40 MB). A recently freed block’s data may still reside in L2. When the block is reallocated and written with new KV data, the write may partially hit L2 instead of going to HBM. With FIFO, freed blocks cycle through the entire pool before reuse, guaranteeing cold cache lines. In practice, LIFO improves L2 hit rates by 5-12% for the block write path.
Both allocate() and free() are O(1). There is no searching, no coalescing, no buddy system. This simplicity is possible because all blocks are the same size — the same insight that makes OS page allocation trivial compared to general-purpose malloc.
The Block Table
Each active sequence maintains a block table — an array that maps logical block indices to physical block IDs:
class SequenceBlockTable:
"""Per-sequence mapping from logical to physical blocks."""
def __init__(self):
self.table: List[int] = [] # table[logical_idx] = physical_block_id
def append_block(self, physical_block_id: int):
"""Sequence grew past current block boundary. Add new mapping."""
self.table.append(physical_block_id)
def get_physical_block(self, logical_idx: int) -> int:
return self.table[logical_idx]
def num_blocks(self) -> int:
return len(self.table)
def release_all(self, allocator: BlockAllocator):
"""Sequence completed. Return all blocks to free list."""
for block_id in self.table:
allocator.free(block_id)
self.table.clear()
On the GPU side, the block tables for all active sequences are packed into a single 2D tensor of shape [max_num_seqs, max_blocks_per_seq] and passed to the attention kernel:
// GPU kernel: resolve logical token position to physical HBM address
__device__ inline void* resolve_kv_address(
const int* __restrict__ block_table, // [max_seqs, max_blocks_per_seq]
int seq_idx,
int token_position,
int block_size,
int stride, // max_blocks_per_seq
void* block_pool, // base address of KV block pool in HBM
size_t block_bytes // bytes per block
) {
int logical_block = token_position / block_size;
int offset_in_block = token_position % block_size;
int physical_block = block_table[seq_idx * stride + logical_block];
return (char*)block_pool + physical_block * block_bytes
+ offset_in_block * TOKEN_KV_BYTES;
}
Allocation Flow: Life of a Request
Let us trace the complete allocation flow for a single request with block_size = 16:
- Request arrives with a prompt of 45 tokens.
- Prefill phase: The scheduler computes blocks needed. Three blocks are popped from the free list. Block table:
[block_412, block_87, block_1903]. The third block is only 13/16 full (3 tokens of internal fragmentation). - Decode step 1: Token 46 is generated. It fits in the third block (slot 14 of 16). No new allocation needed.
- Decode step 2: Token 47 is generated. It fits in slot 15 of block 3. Still no allocation.
- Decode step 3: Token 48 is generated. Block 3 is full. A fourth block is allocated:
[block_412, block_87, block_1903, block_55]. Token 48 goes into slot 0 of the new block. - Steps 4-N: Continue generating. A new block is allocated every 16 tokens.
- Request completes at token 112. Block table has blocks. All 7 are returned to the free list.
Total memory used: . Traditional allocation would have used: . Savings: 94.5%.
The Indirection Cost: What You Pay for Paging
PagedAttention is not free. The block table indirection introduces measurable overhead in the attention kernel. Understanding this cost is essential for deciding whether PagedAttention is the right choice for your workload.
Source of Overhead
In a traditional contiguous KV cache, the attention kernel accesses K and V vectors with a simple stride:
k_addr = k_base + seq_idx * max_seq_len * kv_size + token_pos * kv_size
This is a single multiply-add, and consecutive tokens are adjacent in memory, enabling coalesced HBM reads. With PagedAttention, the access pattern becomes:
logical_block = token_pos / block_size
physical_block = block_table[seq_idx * stride + logical_block] // table lookup
offset = token_pos % block_size
k_addr = block_pool + physical_block * block_bytes + offset * kv_size
This introduces three costs:
-
Pointer chasing: The block table lookup is a dependent memory access. The kernel cannot compute the HBM address until the table lookup completes. If the block table is not in L2/L1 cache, this is a full HBM round-trip (~400-600 ns on A100).
-
Non-contiguous memory access: Physical blocks for a single sequence are scattered across HBM. When the attention kernel iterates over the context (all past tokens), it jumps between non-contiguous blocks. This defeats HBM’s preference for sequential access and reduces effective bandwidth by 15-30%.
-
Reduced kernel fusion opportunities: The scattered access pattern makes it harder to fuse the attention kernel with subsequent operations, because the memory layout is unpredictable at compile time.
Measured Overhead
Attention Kernel Latency: Contiguous vs PagedAttention (A100-80GB)
| Sequence Length | Contiguous (ms) | PagedAttention (ms) | Overhead |
|---|---|---|---|
| 128 tokens | 0.42 | 0.44 | +4.8% |
| 512 tokens | 1.21 | 1.31 | +8.3% |
| 1024 tokens | 2.38 | 2.65 | +11.3% |
| 2048 tokens | 4.71 | 5.38 | +14.2% |
| 4096 tokens | 9.42 | 10.83 | +15.0% |
The overhead ranges from ~4% for short sequences to ~15% for long sequences. Longer sequences require more block table lookups and more scattered memory accesses, increasing the relative overhead.
Why It Is Worth It
The 4-15% attention kernel overhead is the cost per request. But the benefit is measured in aggregate throughput — how many requests per second the system can serve. Because PagedAttention eliminates memory waste, it can fit 2-4x more concurrent requests in the same GPU memory:
System Throughput: Contiguous vs PagedAttention (A100-80GB, Llama-70B)
line| Metric | 8 | 16 | 32 | 48 | 64 | 96 | 128 |
|---|---|---|---|---|---|---|---|
| Contiguous (requests/sec) | |||||||
| PagedAttention (requests/sec) |
At batch size 8, PagedAttention is actually slightly slower (~7% lower throughput) due to the per-request overhead with no memory pressure benefit. But at batch size 16, the contiguous approach is already at the memory wall — it cannot fit more requests. PagedAttention continues scaling to batch sizes of 64-128, achieving 3.4x higher peak throughput.
PagedAttention trades 4-15% per-request attention kernel latency for 2-4x system-level throughput. This trade-off is overwhelmingly favorable for serving workloads with multiple concurrent users. It is unfavorable only for single-user, latency-critical inference.
Copy-on-Write for Beam Search
Beam search is a decoding strategy where the model maintains candidate sequences (beams) at each step, expanding the most promising ones. All beams share the same prompt prefix. Without memory sharing, each beam requires its own copy of the prefix KV cache.
The Memory Problem
For beam width , prefix length 512 tokens, and Llama-70B:
Savings: — a 75% reduction for the prefix portion alone.
How Copy-on-Write Works
PagedAttention implements copy-on-write (CoW) semantics borrowed from fork() in Unix:
- When a beam is forked, the child beam’s block table is a copy of the parent’s block table — pointing to the same physical blocks. No data is copied.
- A reference count is maintained per physical block. After forking, shared blocks have refcount > 1.
- When a beam needs to write new KV data into a block that is shared (refcount > 1), a new physical block is allocated, the old block’s data is copied into it, the beam’s block table is updated to point to the new block, and the old block’s refcount is decremented.
- If a beam needs to write into a block with refcount = 1, it writes in place. No copy needed.
from collections import defaultdict
from typing import Dict, List
class CopyOnWriteBlockManager:
def __init__(self, allocator: BlockAllocator):
self.allocator = allocator
self.ref_counts: Dict[int, int] = defaultdict(lambda: 1)
def fork_beam(self, parent_table: List[int]) -> List[int]:
"""Create a child beam sharing all parent blocks. O(num_blocks)."""
child_table = parent_table.copy() # Copy the table, not the data
for block_id in child_table:
self.ref_counts[block_id] += 1
return child_table
def write_token(self, block_table: List[int], logical_block_idx: int,
token_offset: int, kv_data) -> int:
"""Write KV data, copying block if shared."""
physical_block = block_table[logical_block_idx]
if self.ref_counts[physical_block] > 1:
# Shared block: copy-on-write
new_block = self.allocator.allocate()
if new_block is None:
raise MemoryError("Cannot allocate block for CoW")
self._gpu_copy_block(physical_block, new_block)
self.ref_counts[physical_block] -= 1
self.ref_counts[new_block] = 1
block_table[logical_block_idx] = new_block
physical_block = new_block
# Write KV data into the (now exclusive) block
self._gpu_write_kv(physical_block, token_offset, kv_data)
return physical_block
def free_beam(self, block_table: List[int]):
"""Release a beam's blocks, respecting reference counts."""
for block_id in block_table:
self.ref_counts[block_id] -= 1
if self.ref_counts[block_id] == 0:
self.allocator.free(block_id)
del self.ref_counts[block_id]
def _gpu_copy_block(self, src: int, dst: int):
# cuda memcpy from src block to dst block in block pool
pass
def _gpu_write_kv(self, block_id: int, offset: int, data):
pass
Memory Savings at Scale
Beam Search Memory Usage: Without vs With Copy-on-Write
(GB)For beam width 8 with a 1024-token prefix on Llama-70B, CoW saves approximately — the difference between fitting the workload on one GPU versus needing two.
Block Size Selection Analysis
Block size is the most important hyperparameter in PagedAttention. It determines the trade-off between three competing concerns: internal fragmentation, page table overhead, and kernel efficiency.
The Three-Way Trade-off
Internal fragmentation scales with block size. On average, the last block of each sequence is half full, wasting token slots. Larger blocks mean more waste.
Page table overhead scales inversely with block size. More blocks per sequence means a larger block table, more block table entries to transfer to the GPU, and more pointer chasing in the kernel.
Kernel efficiency generally improves with larger blocks. GPU kernels operate most efficiently when processing data in large contiguous chunks. Larger blocks mean fewer loop iterations in the attention kernel and better utilization of GPU vector units.
Quantitative Analysis
Let us work through the numbers for Llama-70B on A100-80GB, serving 64 concurrent requests with average sequence length 512 tokens:
Block size = 4 tokens:
- Blocks per sequence:
- Internal fragmentation per request:
- Total fragmentation (64 reqs):
- Block table size per request:
- Total block table:
- Kernel loop iterations per attention: 128 (one per block)
- Block table lookups per attention: 128
Block size = 16 tokens:
- Blocks per sequence:
- Internal fragmentation per request:
- Total fragmentation (64 reqs):
- Block table size per request:
- Total block table:
- Kernel loop iterations per attention: 32
- Block table lookups per attention: 32
Block size = 64 tokens:
- Blocks per sequence:
- Internal fragmentation per request:
- Total fragmentation (64 reqs):
- Block table size per request:
- Total block table:
- Kernel loop iterations per attention: 8
- Block table lookups per attention: 8
Block Size Trade-off Analysis (Llama-70B, 64 concurrent reqs, avg 512 tokens)
| Block Size | Fragmentation (total) | Table Overhead | Kernel Iterations | Attention Overhead | Net Throughput |
|---|---|---|---|---|---|
| 4 tokens | 40 MB (0.1%) | 32 KB | 128/seq | +18% | 87% baseline |
| 8 tokens | 80 MB (0.2%) | 16 KB | 64/seq | +10% | 94% baseline |
| 16 tokens | 160 MB (0.4%) | 8 KB | 32/seq | +6% | 100% (baseline) |
| 32 tokens | 320 MB (0.8%) | 4 KB | 16/seq | +3% | 98% baseline |
| 64 tokens | 640 MB (1.6%) | 2 KB | 8/seq | +2% | 93% baseline |
The sweet spot is 16 tokens per block. At this size:
- Internal fragmentation is negligible (0.4% of KV pool)
- Block table overhead is minimal (fits in GPU L2 cache for all sequences)
- Kernel loop count (32 iterations for 512 tokens) is manageable
- Each block is — large enough for efficient HBM burst reads
For long-context workloads (>32K tokens), increasing block size to 32 reduces the number of block table lookups proportionally. The increased fragmentation is less significant when sequences are long (e.g., 32 tokens wasted out of 32,768 is 0.1%). vLLM defaults to 16 but allows configuration via --block-size.
Effective Throughput vs Block Size (Normalized)
line| Metric | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|
| Short sequences (avg 128 tokens) | |||||
| Medium sequences (avg 512 tokens) | |||||
| Long sequences (avg 4096 tokens) |
Note that for long sequences, the optimal block size shifts toward 32 tokens, while for short sequences, 16 remains clearly optimal. This is because the relative cost of block table lookups is higher for long sequences (more lookups per attention computation), and fragmentation matters less when sequences are long.
Memory Watermark and Preemption
Even with PagedAttention, GPU memory is finite. When all blocks are allocated, new requests cannot be admitted and existing requests cannot grow. vLLM handles this through a watermark mechanism and preemption policies.
The Watermark
The block allocator maintains a configurable watermark — a minimum number of free blocks that must be available at all times. This serves two purposes:
- Admission control: New requests are only admitted if enough free blocks exist to hold at least their prompt tokens plus the watermark reserve.
- Growth headroom: Active requests that need new blocks (sequence growing past a block boundary) must always succeed. The watermark ensures blocks are available for in-progress requests.
class WatermarkBlockAllocator(BlockAllocator):
def __init__(self, num_blocks: int, block_size: int,
watermark_fraction: float = 0.01):
super().__init__(num_blocks, block_size)
self.watermark = int(num_blocks * watermark_fraction)
def can_admit(self, num_blocks_needed: int) -> bool:
"""Check if a new request can be admitted."""
return self.num_free_blocks() - num_blocks_needed >= self.watermark
def allocate_or_preempt(self) -> int:
"""Allocate a block, triggering preemption if below watermark."""
if self.num_free_blocks() <= self.watermark:
raise WatermarkViolation("Below watermark — preempt a request")
return self.allocate()
A typical watermark is 1% of total blocks. On a system with 5000 blocks, that is 50 blocks reserved — enough to handle short-term allocation bursts without triggering preemption unnecessarily.
Preemption Strategies
When memory usage exceeds the watermark, vLLM must preempt one or more running requests to free blocks. The scheduler selects the lowest-priority request (typically the most recently arrived, following FCFS priority). Two preemption strategies exist:
Strategy 1: Swap to CPU memory
The preempted request’s KV cache blocks are copied from GPU HBM to CPU DRAM via PCIe. The blocks are freed on the GPU. When the request is later resumed, its KV blocks are copied back from CPU to GPU.
- Cost: PCIe transfer latency. On PCIe Gen4 x16 (~25 GB/s effective), swapping 100 blocks of 5 MB each = 500 MB takes ~20 ms each way.
- Advantage: The request resumes exactly where it left off. No recomputation needed.
- Best for: Long sequences where recomputation would be expensive.
Strategy 2: Recompute (discard and re-prefill)
The preempted request’s KV blocks are simply freed. When the request is resumed, it must re-run the prefill phase over all its prompt + previously generated tokens to reconstruct the KV cache.
- Cost: Recomputation latency. Prefilling 500 tokens on Llama-70B takes ~80-150 ms depending on batch size.
- Advantage: No CPU memory needed. No PCIe bandwidth consumed.
- Best for: Short sequences where recomputation is fast.
When to Swap vs Recompute
The decision depends on the sequence length at the time of preemption:
For Llama-70B on A100 with PCIe Gen4:
Preemption Strategy Comparison (Llama-70B, A100)
| Seq Length at Preemption | Swap Cost (ms) | Recompute Cost (ms) | Better Strategy |
|---|---|---|---|
| 128 tokens | 3.2 | 12 | Swap |
| 256 tokens | 6.4 | 24 | Swap |
| 512 tokens | 12.8 | 48 | Swap |
| 1024 tokens | 25.6 | 95 | Swap |
| 2048 tokens | 51.2 | 190 | Swap |
| 4096 tokens | 102.4 | 380 | Swap |
For these model sizes, swapping is almost always cheaper than recomputation because PCIe bandwidth is much faster than prefill computation. However, swapping requires CPU memory headroom (potentially gigabytes for many preempted requests), and it consumes PCIe bandwidth that could be used for other transfers. In memory-constrained CPU environments, recomputation may be the only option.
If the system is persistently overloaded, preemption can cascade: the resumed request preempts another, which is later resumed and preempts yet another. This thrashing behavior — directly analogous to OS page thrashing — devastates throughput. vLLM mitigates this by reducing the admission rate when preemption frequency exceeds a threshold.
Prefix Caching: Sharing KV Blocks Across Requests
In many production deployments, multiple requests share the same system prompt. A chatbot application might prepend the same 500-token system prompt to every user message. Without sharing, each request computes and stores its own copy of the system prompt’s KV cache — pure waste.
Hash-Based Prefix Deduplication
PagedAttention enables prefix caching by recognizing that KV cache blocks are a deterministic function of their input tokens. Two requests with identical prefix tokens will produce identical KV cache blocks. vLLM exploits this with a hash-based deduplication scheme:
- Hash computation: For each block, compute a hash of the token IDs it contains (plus the block’s position in the sequence, to account for positional encodings). The hash key is:
hash(token_ids, block_position, model_id). - Cache lookup: Before allocating a new block, check if a block with the same hash already exists in the block pool.
- Cache hit: If found, increment the block’s reference count and add it to the new request’s block table. No KV computation needed for this block.
- Cache miss: Allocate a new block, compute the KV data, and insert it into the hash table.
from typing import Tuple
import hashlib
class PrefixCacheManager:
def __init__(self, allocator: BlockAllocator, block_size: int):
self.allocator = allocator
self.block_size = block_size
# hash -> physical_block_id
self.cache: Dict[str, int] = {}
self.ref_counts: Dict[int, int] = defaultdict(int)
def _compute_hash(self, token_ids: Tuple[int, ...],
block_position: int) -> str:
"""Deterministic hash for a block's content."""
data = f"{token_ids}:{block_position}".encode()
return hashlib.sha256(data).hexdigest()[:16]
def get_or_allocate(self, token_ids: Tuple[int, ...],
block_position: int) -> Tuple[int, bool]:
"""
Returns (physical_block_id, cache_hit).
If cache_hit is True, KV computation can be skipped for this block.
"""
h = self._compute_hash(token_ids, block_position)
if h in self.cache:
block_id = self.cache[h]
self.ref_counts[block_id] += 1
return block_id, True # Cache hit - skip KV computation
# Cache miss - allocate new block
block_id = self.allocator.allocate()
if block_id is None:
# Evict least-recently-used cached block
block_id = self._evict_lru()
self.cache[h] = block_id
self.ref_counts[block_id] = 1
return block_id, False # Cache miss - must compute KV
def release(self, block_id: int, token_ids: Tuple[int, ...],
block_position: int):
"""Decrement ref count. Block stays in cache for future reuse."""
self.ref_counts[block_id] -= 1
# Note: we do NOT free the block or remove from cache.
# It remains available for future requests with the same prefix.
# Only evicted under memory pressure.
def _evict_lru(self) -> int:
"""Evict a cached block with ref_count == 0 (not actively used)."""
for h, block_id in list(self.cache.items()):
if self.ref_counts[block_id] == 0:
del self.cache[h]
del self.ref_counts[block_id]
return block_id
raise MemoryError("All cached blocks are actively referenced")
Memory Savings in Production
The savings depend on the ratio of shared prefix length to total sequence length:
where is the number of concurrent requests with the same prefix.
Prefix Caching Memory Savings (Llama-70B)
| Scenario | System Prompt | Avg Response | Concurrent Reqs | Without Caching | With Caching | Savings |
|---|---|---|---|---|---|---|
| Chatbot | 500 tokens | 200 tokens | 64 | 28.0 GB | 8.4 GB | 70% |
| RAG Pipeline | 2000 tokens | 300 tokens | 32 | 23.5 GB | 4.7 GB | 80% |
| Code Assistant | 800 tokens | 500 tokens | 48 | 20.0 GB | 12.0 GB | 40% |
| Translation | 100 tokens | 400 tokens | 64 | 20.5 GB | 16.4 GB | 20% |
| Mixed (no shared prefix) | 0 tokens | 500 tokens | 64 | 10.0 GB | 10.0 GB | 0% |
For chatbot workloads with substantial system prompts, prefix caching saves 30-80% of KV cache memory. This translates directly to higher batch sizes and throughput. In RAG (Retrieval-Augmented Generation) pipelines where the retrieved context is often repeated across related queries, savings can exceed 80%.
Prefix caching and PagedAttention address different sources of waste. PagedAttention eliminates fragmentation (wasted space within allocations). Prefix caching eliminates redundancy (duplicate data across allocations). Together, they can reduce KV cache memory by 80-90% compared to traditional allocation without sharing.
Performance Results: End-to-End Throughput
Let us put it all together and examine end-to-end throughput with all PagedAttention optimizations enabled.
Throughput vs Batch Size
Throughput: Traditional Allocation vs PagedAttention (A100-80GB, Llama-70B)
line| Metric | 1 | 4 | 8 | 16 | 32 | 48 | 64 | 96 | 128 |
|---|---|---|---|---|---|---|---|---|---|
| Traditional (tok/s) | |||||||||
| PagedAttention (tok/s) | |||||||||
| PagedAttention + Prefix Cache (tok/s) |
Key observations:
- At low batch sizes (1-8), PagedAttention is slightly slower (~3-5%) due to the indirection overhead. Both approaches have plenty of memory headroom.
- The traditional approach hits OOM at batch size 32 on this configuration. PagedAttention continues scaling.
- PagedAttention achieves 4.2x peak throughput over the traditional approach (5234 vs 1247 tok/s).
- Adding prefix caching (with a shared 500-token system prompt) pushes throughput to 6.5x at high batch sizes.
Memory Utilization Comparison
GPU Memory Utilization vs Concurrent Requests
line| Metric | 10 | 20 | 40 | 60 | 80 | 100 | 120 |
|---|---|---|---|---|---|---|---|
| Traditional - Allocated (%) | |||||||
| Traditional - Actually Used (%) | |||||||
| PagedAttention - Allocated (%) | |||||||
| PagedAttention - Actually Used (%) |
The gap between “allocated” and “actually used” tells the story. For the traditional approach, 67% of allocated memory is wasted regardless of load. For PagedAttention, the gap between allocated and used is <3% at all load levels — only the last-block internal fragmentation.
End-to-End Performance Summary (A100-80GB, Llama-70B, ShareGPT workload)
| Metric | Traditional | PagedAttention | PA + Prefix Cache | Improvement |
|---|---|---|---|---|
| Max Batch Size | 16-24 | 64-96 | 96-128 | 4-6x |
| Peak Throughput (tok/s) | 1,247 | 5,234 | 8,150 | 4.2-6.5x |
| Memory Utilization | 31% | 95% | 96% | 3.1x |
| P50 Latency (ms/tok) | 42 | 48 | 48 | 0.87x (worse) |
| P99 Latency (ms/tok) | 89 | 148 | 142 | 0.60x (worse) |
| Throughput per GPU-$ | 1x | 4.2x | 6.5x | 4.2-6.5x |
The per-token latency increases under PagedAttention, but this is because the system is serving 4-6x more concurrent requests. On a per-request basis, the latency is similar — the requests that previously waited in a queue are now being served concurrently, improving their total completion time.
When PagedAttention Overhead Is Not Worth It
PagedAttention is not universally beneficial. There are scenarios where the indirection overhead costs more than the memory savings provide:
1. Very Short Sequences (<64 tokens)
When sequences are short, the traditional approach wastes less memory (since actual_length is closer to max_seq_len if max_seq_len is also set low). Meanwhile, the PagedAttention overhead (block table lookups, scattered memory access) still applies. If your workload consists entirely of short prompt-response pairs with max_seq_len < 128, the traditional approach may be faster.
2. Single-Request Inference
If you are serving a single user at a time (e.g., a personal GPU running a local model), there is no batching benefit from saving memory. PagedAttention’s throughput advantage comes from fitting more requests — but if there is only one request, the 4-15% attention overhead is pure cost with no compensating benefit.
3. Latency-Critical Single-User Scenarios
Applications requiring absolute minimum per-token latency for a single stream (e.g., real-time voice synthesis, interactive code completion with aggressive latency SLAs) should avoid PagedAttention overhead. The contiguous memory layout enables faster attention kernel execution and more aggressive kernel fusion.
4. Models with Very Small KV Caches
For small models (e.g., <1B parameters) where the KV cache is a tiny fraction of GPU memory, the memory savings from PagedAttention are negligible. You can fit hundreds of concurrent requests with traditional allocation on modern GPUs. The complexity of block management is not justified.
5. Fixed-Length Workloads
If all requests have exactly the same length (e.g., fixed-format classification tasks), there is no internal fragmentation in the traditional approach. PagedAttention adds overhead without removing any waste.
When to Use PagedAttention
| Scenario | Avg Seq Len | Concurrent Reqs | PagedAttention Benefit | Recommendation |
|---|---|---|---|---|
| Multi-user chatbot | 200-500 | 50-200 | 3-6x throughput | Always use |
| RAG pipeline | 1000-4000 | 20-100 | 2-4x throughput | Always use |
| Batch processing | 500-2000 | 32-128 | 2-3x throughput | Always use |
| Single-user local | 200-1000 | 1 | -5% to 0% | Skip |
| Short classification | 32-64 | 10-50 | 0-10% throughput | Marginal |
| Real-time voice gen | 100-500 | 1 | -15% latency | Skip |
The rule of thumb: if you are serving more than ~4 concurrent requests with variable-length sequences, PagedAttention is almost certainly worth it. If you are serving a single user or dealing with very short fixed-length sequences, the overhead may not be justified.
Implementation Details: The Attention Kernel
For completeness, let us examine the structure of the PagedAttention CUDA kernel — the code that actually executes on the GPU and must handle the block table indirection on every attention computation.
Kernel Structure
// Simplified PagedAttention V2 kernel (vLLM)
// Two-pass approach: first compute partial results per block,
// then reduce across blocks.
template <int BLOCK_SIZE, int HEAD_DIM, int NUM_THREADS>
__global__ void paged_attention_v2_kernel(
const half* __restrict__ q, // [num_seqs, num_heads, head_dim]
const half* __restrict__ k_cache, // [num_blocks, num_kv_heads, block_size, head_dim]
const half* __restrict__ v_cache, // [num_blocks, num_kv_heads, block_size, head_dim]
const int* __restrict__ block_tables, // [num_seqs, max_blocks_per_seq]
const int* __restrict__ context_lens, // [num_seqs]
half* __restrict__ output, // [num_seqs, num_heads, head_dim]
float* __restrict__ exp_sums, // [num_seqs, num_heads, num_partitions]
float* __restrict__ max_logits, // [num_seqs, num_heads, num_partitions]
float* __restrict__ partial_out, // [num_seqs, num_heads, num_partitions, head_dim]
float scale,
int max_blocks_per_seq
) {
const int seq_idx = blockIdx.x;
const int head_idx = blockIdx.y;
const int partition_idx = blockIdx.z; // Each partition handles a subset of KV blocks
const int context_len = context_lens[seq_idx];
const int num_kv_blocks = (context_len + BLOCK_SIZE - 1) / BLOCK_SIZE;
// Load query vector into registers
float q_reg[HEAD_DIM / NUM_THREADS];
// ... load from global memory ...
float local_max = -INFINITY;
float local_sum = 0.0f;
float local_out[HEAD_DIM / NUM_THREADS] = {0};
// Iterate over KV blocks assigned to this partition
for (int block_offset = partition_idx; block_offset < num_kv_blocks;
block_offset += gridDim.z) {
// === BLOCK TABLE LOOKUP (the indirection cost) ===
int physical_block_id = block_tables[
seq_idx * max_blocks_per_seq + block_offset
];
// === SCATTERED HBM READ (non-contiguous access) ===
// Load K block from potentially distant HBM location
// physical_block_id determines the HBM address
const half* k_ptr = k_cache
+ physical_block_id * (NUM_KV_HEADS * BLOCK_SIZE * HEAD_DIM)
+ kv_head_idx * (BLOCK_SIZE * HEAD_DIM);
// Compute QK^T for this block
float scores[BLOCK_SIZE];
for (int t = 0; t < BLOCK_SIZE; t++) {
float dot = 0.0f;
for (int d = threadIdx.x; d < HEAD_DIM; d += NUM_THREADS) {
dot += (float)q_reg[d / NUM_THREADS] * (float)k_ptr[t * HEAD_DIM + d];
}
scores[t] = dot * scale;
}
// Online softmax + value accumulation
// ... (standard online softmax algorithm) ...
}
// Write partial results for reduction
// ... (store local_max, local_sum, local_out) ...
}
The two key performance-critical points are marked: the block table lookup and the scattered HBM read. These are where the PagedAttention overhead manifests.
V1 vs V2 Kernel
vLLM implements two versions of the PagedAttention kernel:
- V1: Single-pass. One CUDA block per (sequence, head). Simple but limited parallelism for long sequences — a single CUDA block must iterate over all KV blocks.
- V2: Two-pass with partitioning. The KV blocks are partitioned across multiple CUDA blocks (the
partition_idxdimension). Each partition computes partial attention results, which are then reduced. This enables much better GPU utilization for long sequences.
V2 is used when the context length exceeds a threshold (typically 512 tokens). The partitioning adds a small overhead for the reduction step but enables much higher parallelism.
Putting It All Together: Memory Budget Calculation
When deploying vLLM in production, you need to calculate the memory budget to determine how many concurrent requests your GPU can serve:
For Llama-70B FP16 on A100-80GB:
- GPU HBM: 80 GB
- Model weights: ~35 GB
- Activations + overhead: ~3 GB
- KV pool: GB
- Block bytes: bytes MB per block
- Number of blocks: blocks
- Max concurrent tokens: tokens
With average sequence length of 500 tokens, that is concurrent requests — compared to with traditional allocation. A 4.1x improvement in concurrent request capacity.
For FP16 models on A100-80GB: divide available KV pool (GB) by 0.32 to get approximate KV tokens per GB for 70B-class models. With PagedAttention, budget for actual average sequence lengths. Without it, budget for max sequence lengths. The ratio between the two is your throughput multiplier.
Conclusion
PagedAttention is fundamentally a memory management technique, not an attention algorithm improvement. It does not change the mathematical computation of attention — it changes how and where the inputs to that computation are stored. By borrowing the virtual memory abstraction from operating systems and applying it to GPU KV cache management, it transforms LLM serving economics.
The key insights:
-
The problem was memory management, not compute: Traditional KV cache allocation wasted 68% of GPU memory on average. This directly limited batch sizes and throughput.
-
OS virtual memory maps perfectly: Physical blocks in GPU HBM, virtual block tables per sequence, demand paging as tokens are generated, copy-on-write for beam search — every OS concept has a direct analog.
-
The indirection cost is real but small: 4-15% attention kernel overhead, depending on sequence length. This is overwhelmed by the 2-4x throughput improvement from higher memory utilization.
-
Block size 16 is the sweet spot: Balances internal fragmentation (~0.4% of pool), page table overhead, and kernel efficiency. Increase to 32 for long-context workloads.
-
Advanced techniques compound: Copy-on-write for beam search, prefix caching for shared system prompts, and watermark-based preemption each add independent memory savings on top of the core paging mechanism.
-
It is not always the right choice: Single-user inference, very short sequences, and latency-critical single-stream applications should use contiguous allocation.
PagedAttention is now the standard KV cache management strategy in production LLM serving. vLLM, TensorRT-LLM, and other major inference engines all implement some variant of it. Understanding its mechanics — the trade-offs, the overhead sources, the configuration knobs — is essential for anyone deploying LLMs at scale.