The KV cache memory management problem in LLM inference is fundamentally a systems problem—fragmentation, allocation overhead, and memory bandwidth utilization. vLLM’s PagedAttention addresses this through OS-inspired virtual memory techniques. Let’s examine the implementation at the memory hierarchy level.
The Memory Fragmentation Problem
Traditional inference engines allocate contiguous memory blocks for each sequence’s KV cache. This creates two critical problems:
Internal Fragmentation: Pre-allocated blocks sized for maximum sequence length waste memory on shorter sequences. With a 2048-token max and 256-token average, we waste ~87.5% of allocated memory.
External Fragmentation: Variable sequence completion times leave scattered free blocks that can’t accommodate new sequences.
On an A100-80GB with Llama-70B using standard allocation, we observed 68% memory waste at 50% GPU utilization. PagedAttention reduced this to 4% waste while achieving 95% utilization.
PagedAttention Memory Layout
PagedAttention divides GPU memory into fixed-size blocks (typically 16 tokens × num_heads × head_dim × 2 [K+V] × dtype_size). Each sequence maintains a block table—a mapping from logical token positions to physical block indices.
GPU HBM Layout with PagedAttention
0x0FFFFFFF 0x00000000 0x1FFFFFFF 0x10000000 0x207FFFFF 0x20000000 0x20FFFFFF 0x20800000 ~35GB ~40GB ~128MB ~8MB Block Allocation Strategy
The block allocator maintains a free list implemented as a stack for O(1) allocation/deallocation:
class BlockAllocator:
def __init__(self, device: Device, block_size: int, num_blocks: int):
self.block_size = block_size
self.num_blocks = num_blocks
# Free list as stack - most recently freed blocks allocated first
# This improves cache locality for recently-active blocks
self.free_blocks: List[int] = list(range(num_blocks))
def allocate(self) -> int:
if not self.free_blocks:
raise OutOfMemoryError("KV cache exhausted")
return self.free_blocks.pop() # O(1) allocation
def free(self, block_id: int) -> None:
self.free_blocks.append(block_id) # O(1) deallocation
Using a stack (LIFO) instead of queue (FIFO) for the free list improves L2 cache hit rates. Recently freed blocks are “warm” in cache and get reused immediately.
Block Table Structure and Access Patterns
Each sequence maintains a block table mapping logical blocks to physical blocks:
// GPU kernel perspective
struct BlockTable {
int32_t* data; // [max_seqs, max_blocks_per_seq]
int32_t stride; // max_blocks_per_seq
};
__device__ int get_physical_block(
const BlockTable& table,
int seq_idx,
int logical_block_idx
) {
return table.data[seq_idx * table.stride + logical_block_idx];
}
The memory access pattern during attention computation becomes:
- Compute logical block index:
logical_block = token_position / block_size - Table lookup:
physical_block = block_table[seq_idx][logical_block] - Compute block offset:
offset = token_position % block_size - Calculate HBM address:
addr = block_pool_base + physical_block * block_bytes + offset * token_bytes
Attention Kernel Memory Access Analysis
The PagedAttention kernel differs fundamentally from standard FlashAttention in its memory access pattern:
// Simplified PagedAttention kernel structure
template<int BLOCK_SIZE, int HEAD_DIM>
__global__ void paged_attention_kernel(
const float* __restrict__ q, // [num_seqs, num_heads, head_dim]
const float* __restrict__ k_cache, // [num_blocks, num_heads, block_size, head_dim]
const float* __restrict__ v_cache, // [num_blocks, num_heads, block_size, head_dim]
const int* __restrict__ block_tables, // [num_seqs, max_blocks]
const int* __restrict__ context_lens, // [num_seqs]
float* __restrict__ output // [num_seqs, num_heads, head_dim]
) {
const int seq_idx = blockIdx.x;
const int head_idx = blockIdx.y;
const int context_len = context_lens[seq_idx];
const int num_blocks = (context_len + BLOCK_SIZE - 1) / BLOCK_SIZE;
// Load query into shared memory - coalesced read
__shared__ float q_shared[HEAD_DIM];
if (threadIdx.x < HEAD_DIM) {
q_shared[threadIdx.x] = q[seq_idx * num_heads * HEAD_DIM +
head_idx * HEAD_DIM + threadIdx.x];
}
__syncthreads();
float acc[HEAD_DIM] = {0.0f};
float max_score = -INFINITY;
float sum_exp = 0.0f;
// Iterate over blocks - non-contiguous memory access
for (int block_idx = 0; block_idx < num_blocks; block_idx++) {
// Block table lookup - potential cache miss
int physical_block = block_tables[seq_idx * max_blocks + block_idx];
// Load K block - strided access pattern
float k_block[BLOCK_SIZE][HEAD_DIM];
#pragma unroll
for (int t = 0; t < BLOCK_SIZE; t++) {
for (int d = threadIdx.x; d < HEAD_DIM; d += blockDim.x) {
k_block[t][d] = k_cache[physical_block * num_heads * BLOCK_SIZE * HEAD_DIM +
head_idx * BLOCK_SIZE * HEAD_DIM +
t * HEAD_DIM + d];
}
}
// Compute attention scores and accumulate...
// [Implementation continues with online softmax]
}
}
The non-contiguous block access pattern can reduce effective HBM bandwidth by 15-30% compared to contiguous access. This is the primary performance trade-off of PagedAttention.
Performance Characteristics
We measured PagedAttention performance across different workloads:
PagedAttention vs Contiguous KV Cache (A100-80GB)
| Configuration | Throughput | Memory Util | Latency P99 |
|---|---|---|---|
| Contiguous (batch=8) | 1,247 tok/s | 34% | 89ms |
| PagedAttention (batch=8) | 1,198 tok/s | 89% | 94ms |
| Contiguous (batch=32) | OOM | - | - |
| PagedAttention (batch=32) | 3,891 tok/s | 94% | 112ms |
| PagedAttention (batch=64) | 5,234 tok/s | 97% | 148ms |
The key insight: PagedAttention trades ~4% single-sequence performance for 2-4x throughput improvement through higher batch sizes.
Memory Copy Optimization: Copy-on-Write for Beam Search
PagedAttention implements copy-on-write (CoW) semantics for beam search scenarios:
class CopyOnWriteHandler:
def __init__(self, allocator: BlockAllocator):
self.allocator = allocator
self.ref_counts: Dict[int, int] = defaultdict(int)
def fork_sequence(self, parent_block_table: List[int]) -> List[int]:
"""Fork a sequence without copying blocks."""
child_table = parent_block_table.copy()
for block_id in child_table:
self.ref_counts[block_id] += 1
return child_table
def write_block(self, seq_id: int, block_idx: int,
block_tables: Dict[int, List[int]]) -> int:
"""Copy block on write if shared."""
old_block = block_tables[seq_id][block_idx]
if self.ref_counts[old_block] > 1:
# Shared block - allocate new and copy
new_block = self.allocator.allocate()
self._copy_block(old_block, new_block)
self.ref_counts[old_block] -= 1
block_tables[seq_id][block_idx] = new_block
return new_block
return old_block # Exclusive access - write in place
Beam Search Memory Savings with CoW
(GB)Block Size Selection Analysis
Block size selection involves multiple trade-offs:
| Block Size | Table Overhead | Fragmentation | Kernel Efficiency |
|---|---|---|---|
| 8 tokens | 2x baseline | ~3% waste | Poor (small tiles) |
| 16 tokens | 1x baseline | ~6% waste | Good |
| 32 tokens | 0.5x baseline | ~12% waste | Excellent |
| 64 tokens | 0.25x baseline | ~24% waste | Excellent |
The default of 16 tokens balances these factors. For long-context workloads (>32K tokens), larger blocks (32-64) may improve kernel efficiency at the cost of higher internal fragmentation.
Profiling PagedAttention
To profile PagedAttention memory behavior:
# Capture memory allocation patterns
nsys profile --trace=cuda,nvtx \
--cuda-memory-usage=true \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf
# Analyze block allocation frequency
ncu --set full \
--kernel-name "paged_attention" \
--launch-skip 100 --launch-count 10 \
python benchmark.py
Key metrics to monitor:
- L2 Cache Hit Rate: Should be >80% with warm blocks
- HBM Bandwidth Utilization: Target >70% during decode
- Block Table Access Latency: Measure via
__ldginstruction counts
Conclusion
PagedAttention’s genius lies in recognizing that LLM inference’s KV cache problem maps directly to virtual memory management. By sacrificing some memory access locality, it eliminates fragmentation and enables dramatically higher throughput. Understanding these trade-offs is essential for optimizing inference deployments.
The next post in this series will examine continuous batching and how it interacts with PagedAttention’s block management.