Part of Series Inference Optimization Timeline 5 of 23
1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Every LLM serving system faces the same bottleneck: GPU memory. Not compute, not network bandwidth — memory. Specifically, the KV cache that stores intermediate attention state for every in-flight request consumes the majority of available GPU HBM during inference. Before PagedAttention, the standard approach was to pre-allocate a contiguous chunk of memory for each request sized to the maximum possible sequence length. The result was catastrophic waste — on average, only 32% of allocated KV cache memory held actual data. The other 68% sat empty, reserving space for tokens that would never be generated.

vLLM’s PagedAttention, introduced by Kwon et al. in 2023, solved this by applying an idea that operating systems perfected decades ago: virtual memory with demand paging. Instead of pre-allocating contiguous memory, PagedAttention divides GPU HBM into fixed-size blocks and allocates them on demand as tokens are generated. The result is near-zero internal fragmentation, dramatically higher memory utilization, and 2-4x throughput improvements from fitting more concurrent requests into the same GPU.

This post dissects PagedAttention from the ground up: the memory waste problem it solves, the OS virtual memory analogy that inspired it, the block allocation mechanics that make it work, the indirection cost you pay for it, and the advanced techniques (copy-on-write, prefix caching, preemption) that build on top of it. We will also examine when PagedAttention’s overhead is not worth it.

The Memory Waste Crisis in LLM Serving

To understand why PagedAttention matters, you need to understand how KV cache memory was managed before it existed.

How Traditional KV Cache Allocation Works

During autoregressive generation, a transformer model computes key and value vectors at every layer for every token in the sequence. These vectors are cached so they do not need to be recomputed on subsequent generation steps. For a model with LL layers, HH attention heads, and head dimension dd, the KV cache for a single token requires:

KV per token=2×L×H×d×dtype_size\text{KV per token} = 2 \times L \times H \times d \times \text{dtype\_size}

For Llama-2-70B with 80 layers, 8 KV heads (GQA), head dimension 128, and FP16 storage:

KV per token=2×80×8×128×2=327,680 bytes320 KB\text{KV per token} = 2 \times 80 \times 8 \times 128 \times 2 = 327{,}680 \text{ bytes} \approx 320 \text{ KB}

For a sequence of 2048 tokens, that is 320 KB×2048640 MB320 \text{ KB} \times 2048 \approx 640 \text{ MB} per request. On an A100 with 80 GB of HBM, after loading the model weights (~35 GB for FP16 Llama-70B), you have roughly 40-42 GB left for KV cache. That means you can serve at most ~64 concurrent requests at max sequence length.

The traditional approach pre-allocates this entire 640 MB block per request at the time of admission:

class TraditionalKVCache:
    def __init__(self, max_seq_len: int = 2048, num_layers: int = 80,
                 num_kv_heads: int = 8, head_dim: int = 128):
        # Pre-allocate for MAXIMUM sequence length, regardless of actual length
        self.k_cache = torch.zeros(
            num_layers, max_seq_len, num_kv_heads, head_dim,
            dtype=torch.float16, device='cuda'
        )  # 320 MB
        self.v_cache = torch.zeros(
            num_layers, max_seq_len, num_kv_heads, head_dim,
            dtype=torch.float16, device='cuda'
        )  # 320 MB
        self.current_length = 0

    def append_token(self, layer_idx: int, k: torch.Tensor, v: torch.Tensor):
        self.k_cache[layer_idx, self.current_length] = k
        self.v_cache[layer_idx, self.current_length] = v
        self.current_length += 1

The problem is obvious: if a request generates only 200 tokens, 90% of the allocation is wasted. But the system cannot know in advance how long a sequence will be — it depends on the model’s output, which is determined token by token.

Quantifying the Waste

In production LLM serving workloads, sequence lengths follow a heavy-tailed distribution. Most requests are short (chatbot responses, summaries), but max_seq_len must accommodate the longest possible output. Empirical measurements from the vLLM paper show:

📊

KV Cache Memory Utilization in Traditional Allocation

WorkloadAvg Seq LenMax Seq LenAvg UtilizationMemory Wasted
Chatbot (ShareGPT) 215 tokens 2048 tokens 28% 72%
Code Generation 412 tokens 4096 tokens 32% 68%
Summarization 524 tokens 2048 tokens 38% 62%
Translation 186 tokens 1024 tokens 30% 70%
Weighted Average - - 32% 68%
Note: Measured across production workloads. Max Seq Len is the configured allocation size.

The average utilization across workloads is approximately 32%. That means 68% of GPU memory dedicated to KV cache is wasted on empty reservations.

⚠️ This Is Not Just Wasteful — It Directly Limits Throughput

If you can only fit 20 requests in memory but 15 of them are using 30% of their allocation, you have enough actual KV data for ~6 requests worth — yet 20 slots are occupied. The 14 requests that could have been batched concurrently are stuck in a queue, waiting. Memory waste translates directly into throughput loss.

The Two Types of Fragmentation

Traditional KV cache allocation suffers from both internal and external fragmentation, mirroring the classic memory management problems studied in operating systems:

Internal fragmentation occurs within each allocation. A request allocated 2048 tokens worth of memory but only using 300 tokens has 2048300=17482048 - 300 = 1748 tokens of internal fragmentation. This is the dominant source of waste, accounting for roughly 60-70% of total memory loss.

External fragmentation occurs between allocations. When requests complete at different times, they leave gaps in the contiguous memory space. A newly arriving request needing 2048 contiguous tokens may find that the total free memory is sufficient but no single contiguous block is large enough. This accounts for another 10-20% of waste, depending on the allocation pattern.

Reservation fragmentation is a third, subtler issue. Even before a request begins generating tokens, the system must reserve memory for the maximum possible output. If the scheduler reserves max_seq_len tokens for 50 requests but only 30 could actually fit if their true lengths were known, 20 requests are unnecessarily delayed.

Memory Waste Breakdown (A100-80GB, Llama-70B, 50 concurrent requests)

(%)
Actual KV Data
32 %
Internal Fragmentation
45 %
+40.6%
External Fragmentation
13 %
Reservation Waste
10 %

The OS Virtual Memory Analogy

The insight behind PagedAttention is that KV cache management maps almost perfectly onto the virtual memory problem that operating systems solved in the 1960s. Let us trace this analogy precisely.

Virtual Memory in Operating Systems

In a modern OS, each process sees a contiguous virtual address space. The hardware (MMU) and OS kernel collaborate to map this virtual space onto physical RAM in fixed-size units called pages (typically 4 KB). Key properties:

  1. No pre-allocation: A process can declare a large virtual address space without consuming physical memory. Pages are allocated on demand when first accessed (demand paging).
  2. Non-contiguous physical storage: Virtual pages 0, 1, 2 of a process might map to physical frames 847, 12, 503. The process sees a contiguous sequence; the hardware handles the indirection.
  3. Page table: A per-process data structure maps virtual page numbers to physical frame numbers. The MMU consults this on every memory access.
  4. Fragmentation elimination: Since all allocation units are the same fixed size, external fragmentation is impossible. Internal fragmentation is limited to the last page (at most page_size - 1 bytes per allocation).

The PagedAttention Mapping

PagedAttention applies this exact scheme to KV cache management:

OS Virtual MemoryPagedAttention KV Cache
Physical RAMGPU HBM (block pool)
Page frame (4 KB)KV block (16 tokens of KV data)
Virtual address spaceLogical KV cache per request
Page table (per process)Block table (per sequence)
MMU translationAttention kernel block table lookup
Demand pagingAllocate blocks as tokens are generated
Page fault handlerBlock allocator (pop from free list)
Swapping to diskSwapping KV blocks to CPU memory
Copy-on-write forkShared prefix blocks in beam search

PagedAttention GPU HBM Layout (A100-80GB, Llama-70B)

0x0FFFFFFF 0x00000000
0x1FFFFFFF 0x10000000
0x207FFFFF 0x20000000
0x20FFFFFF 0x20800000
Model Weights (FP16) ~35 GB
KV Block Pool ~40 GB
Block Tables ~128 MB
Workspace + Activations ~512 MB
Llama-2-70B parameters
Fixed-size blocks, allocated on demand
Per-sequence logical-to-physical mappings
Temporary buffers for forward pass
Model Weights (FP16) ~35 GB
KV Block Pool ~40 GB
Block Tables ~128 MB
Workspace + Activations ~512 MB

The crucial difference from OS virtual memory: there is no hardware MMU on the GPU performing address translation. The indirection must be done in software, inside the attention kernel itself. This is the fundamental cost of PagedAttention — every attention computation must dereference the block table. We will quantify this cost in a later section.

How It Eliminates Waste

With PagedAttention, a new request starts with zero allocated blocks. As the prefill phase processes prompt tokens, blocks are allocated to hold the resulting KV data. During decode, each time the sequence grows past a block boundary, one new block is allocated. When the request completes, all its blocks are returned to the free list.

The only waste is in the last block of each sequence, which may be partially filled. With a block size of 16 tokens, the average waste per request is 8 tokens — that is 8×320 KB/token=2.5 MB8 \times 320 \text{ KB/token} = 2.5 \text{ MB} for Llama-70B. Compare this to the traditional approach where average waste is 1748×320 KB544 MB1748 \times 320 \text{ KB} \approx 544 \text{ MB} per request.

Traditional waste per requestmax_seq_lenavg_seq_len2×kv_per_token\text{Traditional waste per request} \approx \frac{\text{max\_seq\_len} - \text{avg\_seq\_len}}{2} \times \text{kv\_per\_token}

PagedAttention waste per requestblock_size2×kv_per_token\text{PagedAttention waste per request} \approx \frac{\text{block\_size}}{2} \times \text{kv\_per\_token}

For max_seq_len = 2048, avg_seq_len = 300, block_size = 16:

  • Traditional: (2048300)/2×320 KB=279,040 KB272 MB(2048 - 300) / 2 \times 320 \text{ KB} = 279{,}040 \text{ KB} \approx 272 \text{ MB}
  • PagedAttention: 16/2×320 KB=2,560 KB2.5 MB16 / 2 \times 320 \text{ KB} = 2{,}560 \text{ KB} \approx 2.5 \text{ MB}

That is a 109x reduction in per-request memory waste.

Block Allocation Mechanics

The block allocator is the core data structure of PagedAttention. Its design is deceptively simple but carefully optimized for the constraints of GPU serving.

The Free Block List

The allocator maintains a free list implemented as a stack (LIFO). This is a deliberate choice:

from typing import List, Dict, Optional

class BlockAllocator:
    """GPU KV cache block allocator. O(1) alloc and free."""

    def __init__(self, num_blocks: int, block_size: int, device: str = 'cuda'):
        self.num_blocks = num_blocks
        self.block_size = block_size
        self.device = device

        # Free list as a stack: most recently freed blocks are reused first.
        # LIFO order improves GPU L2 cache hit rates because recently-active
        # blocks are "warm" in the cache hierarchy.
        self.free_list: List[int] = list(range(num_blocks))

    def allocate(self) -> Optional[int]:
        """Pop a block from the free list. O(1)."""
        if not self.free_list:
            return None  # Out of memory — trigger preemption
        return self.free_list.pop()

    def free(self, block_id: int) -> None:
        """Push a block back onto the free list. O(1)."""
        self.free_list.append(block_id)

    def num_free_blocks(self) -> int:
        return len(self.free_list)
💡 Why LIFO, Not FIFO?

A stack (LIFO) reuses the most recently freed blocks first. On GPUs, the L2 cache is typically 40-50 MB (A100: 40 MB). A recently freed block’s data may still reside in L2. When the block is reallocated and written with new KV data, the write may partially hit L2 instead of going to HBM. With FIFO, freed blocks cycle through the entire pool before reuse, guaranteeing cold cache lines. In practice, LIFO improves L2 hit rates by 5-12% for the block write path.

Both allocate() and free() are O(1). There is no searching, no coalescing, no buddy system. This simplicity is possible because all blocks are the same size — the same insight that makes OS page allocation trivial compared to general-purpose malloc.

The Block Table

Each active sequence maintains a block table — an array that maps logical block indices to physical block IDs:

class SequenceBlockTable:
    """Per-sequence mapping from logical to physical blocks."""

    def __init__(self):
        self.table: List[int] = []  # table[logical_idx] = physical_block_id

    def append_block(self, physical_block_id: int):
        """Sequence grew past current block boundary. Add new mapping."""
        self.table.append(physical_block_id)

    def get_physical_block(self, logical_idx: int) -> int:
        return self.table[logical_idx]

    def num_blocks(self) -> int:
        return len(self.table)

    def release_all(self, allocator: BlockAllocator):
        """Sequence completed. Return all blocks to free list."""
        for block_id in self.table:
            allocator.free(block_id)
        self.table.clear()

On the GPU side, the block tables for all active sequences are packed into a single 2D tensor of shape [max_num_seqs, max_blocks_per_seq] and passed to the attention kernel:

// GPU kernel: resolve logical token position to physical HBM address
__device__ inline void* resolve_kv_address(
    const int* __restrict__ block_table,  // [max_seqs, max_blocks_per_seq]
    int seq_idx,
    int token_position,
    int block_size,
    int stride,          // max_blocks_per_seq
    void* block_pool,    // base address of KV block pool in HBM
    size_t block_bytes   // bytes per block
) {
    int logical_block = token_position / block_size;
    int offset_in_block = token_position % block_size;
    int physical_block = block_table[seq_idx * stride + logical_block];
    return (char*)block_pool + physical_block * block_bytes
           + offset_in_block * TOKEN_KV_BYTES;
}

Allocation Flow: Life of a Request

Let us trace the complete allocation flow for a single request with block_size = 16:

  1. Request arrives with a prompt of 45 tokens.
  2. Prefill phase: The scheduler computes 45/16=3\lceil 45 / 16 \rceil = 3 blocks needed. Three blocks are popped from the free list. Block table: [block_412, block_87, block_1903]. The third block is only 13/16 full (3 tokens of internal fragmentation).
  3. Decode step 1: Token 46 is generated. It fits in the third block (slot 14 of 16). No new allocation needed.
  4. Decode step 2: Token 47 is generated. It fits in slot 15 of block 3. Still no allocation.
  5. Decode step 3: Token 48 is generated. Block 3 is full. A fourth block is allocated: [block_412, block_87, block_1903, block_55]. Token 48 goes into slot 0 of the new block.
  6. Steps 4-N: Continue generating. A new block is allocated every 16 tokens.
  7. Request completes at token 112. Block table has 112/16=7\lceil 112 / 16 \rceil = 7 blocks. All 7 are returned to the free list.

Total memory used: 7×16×320 KB=35,840 KB35 MB7 \times 16 \times 320 \text{ KB} = 35{,}840 \text{ KB} \approx 35 \text{ MB}. Traditional allocation would have used: 2048×320 KB=640 MB2048 \times 320 \text{ KB} = 640 \text{ MB}. Savings: 94.5%.

The Indirection Cost: What You Pay for Paging

PagedAttention is not free. The block table indirection introduces measurable overhead in the attention kernel. Understanding this cost is essential for deciding whether PagedAttention is the right choice for your workload.

Source of Overhead

In a traditional contiguous KV cache, the attention kernel accesses K and V vectors with a simple stride:

k_addr = k_base + seq_idx * max_seq_len * kv_size + token_pos * kv_size

This is a single multiply-add, and consecutive tokens are adjacent in memory, enabling coalesced HBM reads. With PagedAttention, the access pattern becomes:

logical_block = token_pos / block_size
physical_block = block_table[seq_idx * stride + logical_block]   // table lookup
offset = token_pos % block_size
k_addr = block_pool + physical_block * block_bytes + offset * kv_size

This introduces three costs:

  1. Pointer chasing: The block table lookup is a dependent memory access. The kernel cannot compute the HBM address until the table lookup completes. If the block table is not in L2/L1 cache, this is a full HBM round-trip (~400-600 ns on A100).

  2. Non-contiguous memory access: Physical blocks for a single sequence are scattered across HBM. When the attention kernel iterates over the context (all past tokens), it jumps between non-contiguous blocks. This defeats HBM’s preference for sequential access and reduces effective bandwidth by 15-30%.

  3. Reduced kernel fusion opportunities: The scattered access pattern makes it harder to fuse the attention kernel with subsequent operations, because the memory layout is unpredictable at compile time.

Measured Overhead

📊

Attention Kernel Latency: Contiguous vs PagedAttention (A100-80GB)

Sequence LengthContiguous (ms)PagedAttention (ms)Overhead
128 tokens 0.42 0.44 +4.8%
512 tokens 1.21 1.31 +8.3%
1024 tokens 2.38 2.65 +11.3%
2048 tokens 4.71 5.38 +14.2%
4096 tokens 9.42 10.83 +15.0%
Note: Single-sequence decode step, Llama-70B, FP16, block_size=16. Measured with CUDA event timers, averaged over 1000 iterations.

The overhead ranges from ~4% for short sequences to ~15% for long sequences. Longer sequences require more block table lookups and more scattered memory accesses, increasing the relative overhead.

Why It Is Worth It

The 4-15% attention kernel overhead is the cost per request. But the benefit is measured in aggregate throughput — how many requests per second the system can serve. Because PagedAttention eliminates memory waste, it can fit 2-4x more concurrent requests in the same GPU memory:

System Throughput: Contiguous vs PagedAttention (A100-80GB, Llama-70B)

line
Metric 81632486496128
Contiguous (requests/sec)
4.2
7.8
0
0
0
0
0
PagedAttention (requests/sec)
3.9
7.5
13.1
17.8
21.2
24.6
25.8

At batch size 8, PagedAttention is actually slightly slower (~7% lower throughput) due to the per-request overhead with no memory pressure benefit. But at batch size 16, the contiguous approach is already at the memory wall — it cannot fit more requests. PagedAttention continues scaling to batch sizes of 64-128, achieving 3.4x higher peak throughput.

The Core Trade-off

PagedAttention trades 4-15% per-request attention kernel latency for 2-4x system-level throughput. This trade-off is overwhelmingly favorable for serving workloads with multiple concurrent users. It is unfavorable only for single-user, latency-critical inference.

Beam search is a decoding strategy where the model maintains BB candidate sequences (beams) at each step, expanding the most promising ones. All beams share the same prompt prefix. Without memory sharing, each beam requires its own copy of the prefix KV cache.

The Memory Problem

For beam width B=4B = 4, prefix length 512 tokens, and Llama-70B:

Prefix KV per beam=512×320 KB=160 MB\text{Prefix KV per beam} = 512 \times 320 \text{ KB} = 160 \text{ MB} Total without sharing=4×160 MB=640 MB\text{Total without sharing} = 4 \times 160 \text{ MB} = 640 \text{ MB} With sharing=160 MB (one copy)+divergent tokens only\text{With sharing} = 160 \text{ MB} \text{ (one copy)} + \text{divergent tokens only}

Savings: 640160=480 MB640 - 160 = 480 \text{ MB} — a 75% reduction for the prefix portion alone.

How Copy-on-Write Works

PagedAttention implements copy-on-write (CoW) semantics borrowed from fork() in Unix:

  1. When a beam is forked, the child beam’s block table is a copy of the parent’s block table — pointing to the same physical blocks. No data is copied.
  2. A reference count is maintained per physical block. After forking, shared blocks have refcount > 1.
  3. When a beam needs to write new KV data into a block that is shared (refcount > 1), a new physical block is allocated, the old block’s data is copied into it, the beam’s block table is updated to point to the new block, and the old block’s refcount is decremented.
  4. If a beam needs to write into a block with refcount = 1, it writes in place. No copy needed.
from collections import defaultdict
from typing import Dict, List

class CopyOnWriteBlockManager:
    def __init__(self, allocator: BlockAllocator):
        self.allocator = allocator
        self.ref_counts: Dict[int, int] = defaultdict(lambda: 1)

    def fork_beam(self, parent_table: List[int]) -> List[int]:
        """Create a child beam sharing all parent blocks. O(num_blocks)."""
        child_table = parent_table.copy()  # Copy the table, not the data
        for block_id in child_table:
            self.ref_counts[block_id] += 1
        return child_table

    def write_token(self, block_table: List[int], logical_block_idx: int,
                    token_offset: int, kv_data) -> int:
        """Write KV data, copying block if shared."""
        physical_block = block_table[logical_block_idx]

        if self.ref_counts[physical_block] > 1:
            # Shared block: copy-on-write
            new_block = self.allocator.allocate()
            if new_block is None:
                raise MemoryError("Cannot allocate block for CoW")
            self._gpu_copy_block(physical_block, new_block)
            self.ref_counts[physical_block] -= 1
            self.ref_counts[new_block] = 1
            block_table[logical_block_idx] = new_block
            physical_block = new_block

        # Write KV data into the (now exclusive) block
        self._gpu_write_kv(physical_block, token_offset, kv_data)
        return physical_block

    def free_beam(self, block_table: List[int]):
        """Release a beam's blocks, respecting reference counts."""
        for block_id in block_table:
            self.ref_counts[block_id] -= 1
            if self.ref_counts[block_id] == 0:
                self.allocator.free(block_id)
                del self.ref_counts[block_id]

    def _gpu_copy_block(self, src: int, dst: int):
        # cuda memcpy from src block to dst block in block pool
        pass

    def _gpu_write_kv(self, block_id: int, offset: int, data):
        pass

Memory Savings at Scale

Beam Search Memory Usage: Without vs With Copy-on-Write

(GB)
No CoW, beam=4
32.4 GB
CoW, beam=4
12.8 GB
No CoW, beam=8
64.8 GB
+100.0%
CoW, beam=8
18.2 GB
No CoW, beam=16
129.6 GB
+300.0%
CoW, beam=16
28.4 GB

For beam width 8 with a 1024-token prefix on Llama-70B, CoW saves approximately 7×1024×320 KB2.19 GB7 \times 1024 \times 320 \text{ KB} \approx 2.19 \text{ GB} — the difference between fitting the workload on one GPU versus needing two.

Block Size Selection Analysis

Block size is the most important hyperparameter in PagedAttention. It determines the trade-off between three competing concerns: internal fragmentation, page table overhead, and kernel efficiency.

The Three-Way Trade-off

Internal fragmentation scales with block size. On average, the last block of each sequence is half full, wasting block_size/2\text{block\_size} / 2 token slots. Larger blocks mean more waste.

Page table overhead scales inversely with block size. More blocks per sequence means a larger block table, more block table entries to transfer to the GPU, and more pointer chasing in the kernel.

Kernel efficiency generally improves with larger blocks. GPU kernels operate most efficiently when processing data in large contiguous chunks. Larger blocks mean fewer loop iterations in the attention kernel and better utilization of GPU vector units.

Quantitative Analysis

Let us work through the numbers for Llama-70B on A100-80GB, serving 64 concurrent requests with average sequence length 512 tokens:

Block size = 4 tokens:

  • Blocks per sequence: 512/4=128512 / 4 = 128
  • Internal fragmentation per request: 4/2×320 KB=640 KB4/2 \times 320 \text{ KB} = 640 \text{ KB}
  • Total fragmentation (64 reqs): 64×640 KB=40 MB64 \times 640 \text{ KB} = 40 \text{ MB}
  • Block table size per request: 128×4 bytes=512 bytes128 \times 4 \text{ bytes} = 512 \text{ bytes}
  • Total block table: 64×512=32 KB64 \times 512 = 32 \text{ KB}
  • Kernel loop iterations per attention: 128 (one per block)
  • Block table lookups per attention: 128

Block size = 16 tokens:

  • Blocks per sequence: 512/16=32512 / 16 = 32
  • Internal fragmentation per request: 16/2×320 KB=2,560 KB2.5 MB16/2 \times 320 \text{ KB} = 2{,}560 \text{ KB} \approx 2.5 \text{ MB}
  • Total fragmentation (64 reqs): 64×2.5 MB=160 MB64 \times 2.5 \text{ MB} = 160 \text{ MB}
  • Block table size per request: 32×4 bytes=128 bytes32 \times 4 \text{ bytes} = 128 \text{ bytes}
  • Total block table: 64×128=8 KB64 \times 128 = 8 \text{ KB}
  • Kernel loop iterations per attention: 32
  • Block table lookups per attention: 32

Block size = 64 tokens:

  • Blocks per sequence: 512/64=8512 / 64 = 8
  • Internal fragmentation per request: 64/2×320 KB=10,240 KB=10 MB64/2 \times 320 \text{ KB} = 10{,}240 \text{ KB} = 10 \text{ MB}
  • Total fragmentation (64 reqs): 64×10 MB=640 MB64 \times 10 \text{ MB} = 640 \text{ MB}
  • Block table size per request: 8×4 bytes=32 bytes8 \times 4 \text{ bytes} = 32 \text{ bytes}
  • Total block table: 64×32=2 KB64 \times 32 = 2 \text{ KB}
  • Kernel loop iterations per attention: 8
  • Block table lookups per attention: 8
📊

Block Size Trade-off Analysis (Llama-70B, 64 concurrent reqs, avg 512 tokens)

Block SizeFragmentation (total)Table OverheadKernel IterationsAttention OverheadNet Throughput
4 tokens 40 MB (0.1%) 32 KB 128/seq +18% 87% baseline
8 tokens 80 MB (0.2%) 16 KB 64/seq +10% 94% baseline
16 tokens 160 MB (0.4%) 8 KB 32/seq +6% 100% (baseline)
32 tokens 320 MB (0.8%) 4 KB 16/seq +3% 98% baseline
64 tokens 640 MB (1.6%) 2 KB 8/seq +2% 93% baseline
Note: Net throughput accounts for both fragmentation-reduced batch capacity and kernel overhead.

The sweet spot is 16 tokens per block. At this size:

  • Internal fragmentation is negligible (0.4% of KV pool)
  • Block table overhead is minimal (fits in GPU L2 cache for all sequences)
  • Kernel loop count (32 iterations for 512 tokens) is manageable
  • Each block is 16×320 KB=5 MB16 \times 320 \text{ KB} = 5 \text{ MB} — large enough for efficient HBM burst reads
💡 Context-Length-Dependent Tuning

For long-context workloads (>32K tokens), increasing block size to 32 reduces the number of block table lookups proportionally. The increased fragmentation is less significant when sequences are long (e.g., 32 tokens wasted out of 32,768 is 0.1%). vLLM defaults to 16 but allows configuration via --block-size.

Effective Throughput vs Block Size (Normalized)

line
Metric 48163264
Short sequences (avg 128 tokens)
82
91
100
95
85
Medium sequences (avg 512 tokens)
87
94
100
98
93
Long sequences (avg 4096 tokens)
75
88
97
100
98

Note that for long sequences, the optimal block size shifts toward 32 tokens, while for short sequences, 16 remains clearly optimal. This is because the relative cost of block table lookups is higher for long sequences (more lookups per attention computation), and fragmentation matters less when sequences are long.

Memory Watermark and Preemption

Even with PagedAttention, GPU memory is finite. When all blocks are allocated, new requests cannot be admitted and existing requests cannot grow. vLLM handles this through a watermark mechanism and preemption policies.

The Watermark

The block allocator maintains a configurable watermark — a minimum number of free blocks that must be available at all times. This serves two purposes:

  1. Admission control: New requests are only admitted if enough free blocks exist to hold at least their prompt tokens plus the watermark reserve.
  2. Growth headroom: Active requests that need new blocks (sequence growing past a block boundary) must always succeed. The watermark ensures blocks are available for in-progress requests.
class WatermarkBlockAllocator(BlockAllocator):
    def __init__(self, num_blocks: int, block_size: int,
                 watermark_fraction: float = 0.01):
        super().__init__(num_blocks, block_size)
        self.watermark = int(num_blocks * watermark_fraction)

    def can_admit(self, num_blocks_needed: int) -> bool:
        """Check if a new request can be admitted."""
        return self.num_free_blocks() - num_blocks_needed >= self.watermark

    def allocate_or_preempt(self) -> int:
        """Allocate a block, triggering preemption if below watermark."""
        if self.num_free_blocks() <= self.watermark:
            raise WatermarkViolation("Below watermark — preempt a request")
        return self.allocate()

A typical watermark is 1% of total blocks. On a system with 5000 blocks, that is 50 blocks reserved — enough to handle short-term allocation bursts without triggering preemption unnecessarily.

Preemption Strategies

When memory usage exceeds the watermark, vLLM must preempt one or more running requests to free blocks. The scheduler selects the lowest-priority request (typically the most recently arrived, following FCFS priority). Two preemption strategies exist:

Strategy 1: Swap to CPU memory

The preempted request’s KV cache blocks are copied from GPU HBM to CPU DRAM via PCIe. The blocks are freed on the GPU. When the request is later resumed, its KV blocks are copied back from CPU to GPU.

  • Cost: PCIe transfer latency. On PCIe Gen4 x16 (~25 GB/s effective), swapping 100 blocks of 5 MB each = 500 MB takes ~20 ms each way.
  • Advantage: The request resumes exactly where it left off. No recomputation needed.
  • Best for: Long sequences where recomputation would be expensive.

Strategy 2: Recompute (discard and re-prefill)

The preempted request’s KV blocks are simply freed. When the request is resumed, it must re-run the prefill phase over all its prompt + previously generated tokens to reconstruct the KV cache.

  • Cost: Recomputation latency. Prefilling 500 tokens on Llama-70B takes ~80-150 ms depending on batch size.
  • Advantage: No CPU memory needed. No PCIe bandwidth consumed.
  • Best for: Short sequences where recomputation is fast.

When to Swap vs Recompute

The decision depends on the sequence length at the time of preemption:

Swap cost=2×num_blocks×block_bytesPCIe bandwidth\text{Swap cost} = \frac{2 \times \text{num\_blocks} \times \text{block\_bytes}}{\text{PCIe bandwidth}}

Recompute cost=seq_lengthprefill_throughput\text{Recompute cost} = \frac{\text{seq\_length}}{\text{prefill\_throughput}}

For Llama-70B on A100 with PCIe Gen4:

📊

Preemption Strategy Comparison (Llama-70B, A100)

Seq Length at PreemptionSwap Cost (ms)Recompute Cost (ms)Better Strategy
128 tokens 3.2 12 Swap
256 tokens 6.4 24 Swap
512 tokens 12.8 48 Swap
1024 tokens 25.6 95 Swap
2048 tokens 51.2 190 Swap
4096 tokens 102.4 380 Swap
Note: Swap cost = round-trip (out + in). Recompute cost = full prefill of prompt + generated tokens.

For these model sizes, swapping is almost always cheaper than recomputation because PCIe bandwidth is much faster than prefill computation. However, swapping requires CPU memory headroom (potentially gigabytes for many preempted requests), and it consumes PCIe bandwidth that could be used for other transfers. In memory-constrained CPU environments, recomputation may be the only option.

⚠️ Preemption Cascades

If the system is persistently overloaded, preemption can cascade: the resumed request preempts another, which is later resumed and preempts yet another. This thrashing behavior — directly analogous to OS page thrashing — devastates throughput. vLLM mitigates this by reducing the admission rate when preemption frequency exceeds a threshold.

Prefix Caching: Sharing KV Blocks Across Requests

In many production deployments, multiple requests share the same system prompt. A chatbot application might prepend the same 500-token system prompt to every user message. Without sharing, each request computes and stores its own copy of the system prompt’s KV cache — pure waste.

Hash-Based Prefix Deduplication

PagedAttention enables prefix caching by recognizing that KV cache blocks are a deterministic function of their input tokens. Two requests with identical prefix tokens will produce identical KV cache blocks. vLLM exploits this with a hash-based deduplication scheme:

  1. Hash computation: For each block, compute a hash of the token IDs it contains (plus the block’s position in the sequence, to account for positional encodings). The hash key is: hash(token_ids, block_position, model_id).
  2. Cache lookup: Before allocating a new block, check if a block with the same hash already exists in the block pool.
  3. Cache hit: If found, increment the block’s reference count and add it to the new request’s block table. No KV computation needed for this block.
  4. Cache miss: Allocate a new block, compute the KV data, and insert it into the hash table.
from typing import Tuple
import hashlib

class PrefixCacheManager:
    def __init__(self, allocator: BlockAllocator, block_size: int):
        self.allocator = allocator
        self.block_size = block_size
        # hash -> physical_block_id
        self.cache: Dict[str, int] = {}
        self.ref_counts: Dict[int, int] = defaultdict(int)

    def _compute_hash(self, token_ids: Tuple[int, ...],
                      block_position: int) -> str:
        """Deterministic hash for a block's content."""
        data = f"{token_ids}:{block_position}".encode()
        return hashlib.sha256(data).hexdigest()[:16]

    def get_or_allocate(self, token_ids: Tuple[int, ...],
                        block_position: int) -> Tuple[int, bool]:
        """
        Returns (physical_block_id, cache_hit).
        If cache_hit is True, KV computation can be skipped for this block.
        """
        h = self._compute_hash(token_ids, block_position)

        if h in self.cache:
            block_id = self.cache[h]
            self.ref_counts[block_id] += 1
            return block_id, True  # Cache hit - skip KV computation

        # Cache miss - allocate new block
        block_id = self.allocator.allocate()
        if block_id is None:
            # Evict least-recently-used cached block
            block_id = self._evict_lru()
        self.cache[h] = block_id
        self.ref_counts[block_id] = 1
        return block_id, False  # Cache miss - must compute KV

    def release(self, block_id: int, token_ids: Tuple[int, ...],
                block_position: int):
        """Decrement ref count. Block stays in cache for future reuse."""
        self.ref_counts[block_id] -= 1
        # Note: we do NOT free the block or remove from cache.
        # It remains available for future requests with the same prefix.
        # Only evicted under memory pressure.

    def _evict_lru(self) -> int:
        """Evict a cached block with ref_count == 0 (not actively used)."""
        for h, block_id in list(self.cache.items()):
            if self.ref_counts[block_id] == 0:
                del self.cache[h]
                del self.ref_counts[block_id]
                return block_id
        raise MemoryError("All cached blocks are actively referenced")

Memory Savings in Production

The savings depend on the ratio of shared prefix length to total sequence length:

Savings=11+(N1)×unique_lentotal_lenN\text{Savings} = 1 - \frac{1 + (N-1) \times \frac{\text{unique\_len}}{\text{total\_len}}}{N}

where NN is the number of concurrent requests with the same prefix.

📊

Prefix Caching Memory Savings (Llama-70B)

ScenarioSystem PromptAvg ResponseConcurrent ReqsWithout CachingWith CachingSavings
Chatbot 500 tokens 200 tokens 64 28.0 GB 8.4 GB 70%
RAG Pipeline 2000 tokens 300 tokens 32 23.5 GB 4.7 GB 80%
Code Assistant 800 tokens 500 tokens 48 20.0 GB 12.0 GB 40%
Translation 100 tokens 400 tokens 64 20.5 GB 16.4 GB 20%
Mixed (no shared prefix) 0 tokens 500 tokens 64 10.0 GB 10.0 GB 0%
Note: Memory values are KV cache only. Savings = 0% when no prefix is shared.

For chatbot workloads with substantial system prompts, prefix caching saves 30-80% of KV cache memory. This translates directly to higher batch sizes and throughput. In RAG (Retrieval-Augmented Generation) pipelines where the retrieved context is often repeated across related queries, savings can exceed 80%.

Prefix Caching + PagedAttention = Compounding Gains

Prefix caching and PagedAttention address different sources of waste. PagedAttention eliminates fragmentation (wasted space within allocations). Prefix caching eliminates redundancy (duplicate data across allocations). Together, they can reduce KV cache memory by 80-90% compared to traditional allocation without sharing.

Performance Results: End-to-End Throughput

Let us put it all together and examine end-to-end throughput with all PagedAttention optimizations enabled.

Throughput vs Batch Size

Throughput: Traditional Allocation vs PagedAttention (A100-80GB, Llama-70B)

line
Metric 1481632486496128
Traditional (tok/s)
142
548
1047
1247
0
0
0
0
0
PagedAttention (tok/s)
138
531
998
1198
2891
3891
5234
6102
6380
PagedAttention + Prefix Cache (tok/s)
138
531
998
1198
2891
4120
5890
7245
8150

Key observations:

  • At low batch sizes (1-8), PagedAttention is slightly slower (~3-5%) due to the indirection overhead. Both approaches have plenty of memory headroom.
  • The traditional approach hits OOM at batch size 32 on this configuration. PagedAttention continues scaling.
  • PagedAttention achieves 4.2x peak throughput over the traditional approach (5234 vs 1247 tok/s).
  • Adding prefix caching (with a shared 500-token system prompt) pushes throughput to 6.5x at high batch sizes.

Memory Utilization Comparison

GPU Memory Utilization vs Concurrent Requests

line
Metric 1020406080100120
Traditional - Allocated (%)
98
98
98
98
98
98
98
Traditional - Actually Used (%)
31
31
31
31
31
31
31
PagedAttention - Allocated (%)
15
30
58
82
92
96
98
PagedAttention - Actually Used (%)
14
29
56
79
89
93
95

The gap between “allocated” and “actually used” tells the story. For the traditional approach, 67% of allocated memory is wasted regardless of load. For PagedAttention, the gap between allocated and used is <3% at all load levels — only the last-block internal fragmentation.

📊

End-to-End Performance Summary (A100-80GB, Llama-70B, ShareGPT workload)

MetricTraditionalPagedAttentionPA + Prefix CacheImprovement
Max Batch Size 16-24 64-96 96-128 4-6x
Peak Throughput (tok/s) 1,247 5,234 8,150 4.2-6.5x
Memory Utilization 31% 95% 96% 3.1x
P50 Latency (ms/tok) 42 48 48 0.87x (worse)
P99 Latency (ms/tok) 89 148 142 0.60x (worse)
Throughput per GPU-$ 1x 4.2x 6.5x 4.2-6.5x
Note: P50/P99 latency is per-token decode latency under full load. Latency increases because more requests are batched.

The per-token latency increases under PagedAttention, but this is because the system is serving 4-6x more concurrent requests. On a per-request basis, the latency is similar — the requests that previously waited in a queue are now being served concurrently, improving their total completion time.

When PagedAttention Overhead Is Not Worth It

PagedAttention is not universally beneficial. There are scenarios where the indirection overhead costs more than the memory savings provide:

1. Very Short Sequences (<64 tokens)

When sequences are short, the traditional approach wastes less memory (since actual_length is closer to max_seq_len if max_seq_len is also set low). Meanwhile, the PagedAttention overhead (block table lookups, scattered memory access) still applies. If your workload consists entirely of short prompt-response pairs with max_seq_len < 128, the traditional approach may be faster.

2. Single-Request Inference

If you are serving a single user at a time (e.g., a personal GPU running a local model), there is no batching benefit from saving memory. PagedAttention’s throughput advantage comes from fitting more requests — but if there is only one request, the 4-15% attention overhead is pure cost with no compensating benefit.

3. Latency-Critical Single-User Scenarios

Applications requiring absolute minimum per-token latency for a single stream (e.g., real-time voice synthesis, interactive code completion with aggressive latency SLAs) should avoid PagedAttention overhead. The contiguous memory layout enables faster attention kernel execution and more aggressive kernel fusion.

4. Models with Very Small KV Caches

For small models (e.g., <1B parameters) where the KV cache is a tiny fraction of GPU memory, the memory savings from PagedAttention are negligible. You can fit hundreds of concurrent requests with traditional allocation on modern GPUs. The complexity of block management is not justified.

5. Fixed-Length Workloads

If all requests have exactly the same length (e.g., fixed-format classification tasks), there is no internal fragmentation in the traditional approach. PagedAttention adds overhead without removing any waste.

📊

When to Use PagedAttention

ScenarioAvg Seq LenConcurrent ReqsPagedAttention BenefitRecommendation
Multi-user chatbot 200-500 50-200 3-6x throughput Always use
RAG pipeline 1000-4000 20-100 2-4x throughput Always use
Batch processing 500-2000 32-128 2-3x throughput Always use
Single-user local 200-1000 1 -5% to 0% Skip
Short classification 32-64 10-50 0-10% throughput Marginal
Real-time voice gen 100-500 1 -15% latency Skip

The rule of thumb: if you are serving more than ~4 concurrent requests with variable-length sequences, PagedAttention is almost certainly worth it. If you are serving a single user or dealing with very short fixed-length sequences, the overhead may not be justified.

Implementation Details: The Attention Kernel

For completeness, let us examine the structure of the PagedAttention CUDA kernel — the code that actually executes on the GPU and must handle the block table indirection on every attention computation.

Kernel Structure

// Simplified PagedAttention V2 kernel (vLLM)
// Two-pass approach: first compute partial results per block,
// then reduce across blocks.

template <int BLOCK_SIZE, int HEAD_DIM, int NUM_THREADS>
__global__ void paged_attention_v2_kernel(
    const half* __restrict__ q,              // [num_seqs, num_heads, head_dim]
    const half* __restrict__ k_cache,        // [num_blocks, num_kv_heads, block_size, head_dim]
    const half* __restrict__ v_cache,        // [num_blocks, num_kv_heads, block_size, head_dim]
    const int* __restrict__ block_tables,    // [num_seqs, max_blocks_per_seq]
    const int* __restrict__ context_lens,    // [num_seqs]
    half* __restrict__ output,               // [num_seqs, num_heads, head_dim]
    float* __restrict__ exp_sums,            // [num_seqs, num_heads, num_partitions]
    float* __restrict__ max_logits,          // [num_seqs, num_heads, num_partitions]
    float* __restrict__ partial_out,         // [num_seqs, num_heads, num_partitions, head_dim]
    float scale,
    int max_blocks_per_seq
) {
    const int seq_idx = blockIdx.x;
    const int head_idx = blockIdx.y;
    const int partition_idx = blockIdx.z;  // Each partition handles a subset of KV blocks

    const int context_len = context_lens[seq_idx];
    const int num_kv_blocks = (context_len + BLOCK_SIZE - 1) / BLOCK_SIZE;

    // Load query vector into registers
    float q_reg[HEAD_DIM / NUM_THREADS];
    // ... load from global memory ...

    float local_max = -INFINITY;
    float local_sum = 0.0f;
    float local_out[HEAD_DIM / NUM_THREADS] = {0};

    // Iterate over KV blocks assigned to this partition
    for (int block_offset = partition_idx; block_offset < num_kv_blocks;
         block_offset += gridDim.z) {

        // === BLOCK TABLE LOOKUP (the indirection cost) ===
        int physical_block_id = block_tables[
            seq_idx * max_blocks_per_seq + block_offset
        ];

        // === SCATTERED HBM READ (non-contiguous access) ===
        // Load K block from potentially distant HBM location
        // physical_block_id determines the HBM address
        const half* k_ptr = k_cache
            + physical_block_id * (NUM_KV_HEADS * BLOCK_SIZE * HEAD_DIM)
            + kv_head_idx * (BLOCK_SIZE * HEAD_DIM);

        // Compute QK^T for this block
        float scores[BLOCK_SIZE];
        for (int t = 0; t < BLOCK_SIZE; t++) {
            float dot = 0.0f;
            for (int d = threadIdx.x; d < HEAD_DIM; d += NUM_THREADS) {
                dot += (float)q_reg[d / NUM_THREADS] * (float)k_ptr[t * HEAD_DIM + d];
            }
            scores[t] = dot * scale;
        }

        // Online softmax + value accumulation
        // ... (standard online softmax algorithm) ...
    }

    // Write partial results for reduction
    // ... (store local_max, local_sum, local_out) ...
}

The two key performance-critical points are marked: the block table lookup and the scattered HBM read. These are where the PagedAttention overhead manifests.

V1 vs V2 Kernel

vLLM implements two versions of the PagedAttention kernel:

  • V1: Single-pass. One CUDA block per (sequence, head). Simple but limited parallelism for long sequences — a single CUDA block must iterate over all KV blocks.
  • V2: Two-pass with partitioning. The KV blocks are partitioned across multiple CUDA blocks (the partition_idx dimension). Each partition computes partial attention results, which are then reduced. This enables much better GPU utilization for long sequences.

V2 is used when the context length exceeds a threshold (typically 512 tokens). The partitioning adds a small overhead for the reduction step but enables much higher parallelism.

Putting It All Together: Memory Budget Calculation

When deploying vLLM in production, you need to calculate the memory budget to determine how many concurrent requests your GPU can serve:

KV pool size=GPU HBMmodel weightsactivationsoverhead\text{KV pool size} = \text{GPU HBM} - \text{model weights} - \text{activations} - \text{overhead}

Num blocks=KV pool sizeblock size in bytes\text{Num blocks} = \frac{\text{KV pool size}}{\text{block size in bytes}}

Block size in bytes=block_size×2×L×Hkv×d×dtype_size\text{Block size in bytes} = \text{block\_size} \times 2 \times L \times H_{kv} \times d \times \text{dtype\_size}

Max concurrent tokens=Num blocks×block_size\text{Max concurrent tokens} = \text{Num blocks} \times \text{block\_size}

For Llama-70B FP16 on A100-80GB:

  • GPU HBM: 80 GB
  • Model weights: ~35 GB
  • Activations + overhead: ~3 GB
  • KV pool: 80353=4280 - 35 - 3 = 42 GB
  • Block bytes: 16×2×80×8×128×2=5,242,88016 \times 2 \times 80 \times 8 \times 128 \times 2 = 5{,}242{,}880 bytes =5= 5 MB per block
  • Number of blocks: 42,000/5=8,40042{,}000 / 5 = 8{,}400 blocks
  • Max concurrent tokens: 8,400×16=134,4008{,}400 \times 16 = 134{,}400 tokens

With average sequence length of 500 tokens, that is 134,400/500268134{,}400 / 500 \approx 268 concurrent requests — compared to 42,000/6406542{,}000 / 640 \approx 65 with traditional allocation. A 4.1x improvement in concurrent request capacity.

Production Sizing Rule of Thumb

For FP16 models on A100-80GB: divide available KV pool (GB) by 0.32 to get approximate KV tokens per GB for 70B-class models. With PagedAttention, budget for actual average sequence lengths. Without it, budget for max sequence lengths. The ratio between the two is your throughput multiplier.

Conclusion

PagedAttention is fundamentally a memory management technique, not an attention algorithm improvement. It does not change the mathematical computation of attention — it changes how and where the inputs to that computation are stored. By borrowing the virtual memory abstraction from operating systems and applying it to GPU KV cache management, it transforms LLM serving economics.

The key insights:

  1. The problem was memory management, not compute: Traditional KV cache allocation wasted 68% of GPU memory on average. This directly limited batch sizes and throughput.

  2. OS virtual memory maps perfectly: Physical blocks in GPU HBM, virtual block tables per sequence, demand paging as tokens are generated, copy-on-write for beam search — every OS concept has a direct analog.

  3. The indirection cost is real but small: 4-15% attention kernel overhead, depending on sequence length. This is overwhelmed by the 2-4x throughput improvement from higher memory utilization.

  4. Block size 16 is the sweet spot: Balances internal fragmentation (~0.4% of pool), page table overhead, and kernel efficiency. Increase to 32 for long-context workloads.

  5. Advanced techniques compound: Copy-on-write for beam search, prefix caching for shared system prompts, and watermark-based preemption each add independent memory savings on top of the core paging mechanism.

  6. It is not always the right choice: Single-user inference, very short sequences, and latency-critical single-stream applications should use contiguous allocation.

PagedAttention is now the standard KV cache management strategy in production LLM serving. vLLM, TensorRT-LLM, and other major inference engines all implement some variant of it. Understanding its mechanics — the trade-offs, the overhead sources, the configuration knobs — is essential for anyone deploying LLMs at scale.