vLLM's PagedAttention Kernel: How Virtual Memory for KV Cache Works at the CUDA Level

Part of Series vLLM Internals 3 of 4

1 vLLM Architecture: A Source Code Walkthrough of the Most Popular LLM Serving Engine 2 vLLM's Scheduler: How Continuous Batching Actually Works in Code 3 vLLM's PagedAttention Kernel: How Virtual Memory for KV Cache Works at the CUDA Level 4 KV Cache Quantization: Doubling Serving Capacity with INT8/FP8 Keys and Values

The PagedAttention kernel is the single most important piece of CUDA code in vLLM. It is what makes paged KV cache — the virtual memory system for attention — actually work at GPU speeds. Without this kernel, the block table indirection would be prohibitively slow, and vLLM’s entire memory management architecture would collapse.

This post goes inside the kernel: how it maps thread blocks to attention heads, how it iterates over KV cache blocks via the block table, how it computes online softmax with non-contiguous memory, and why the v2 kernel is faster than v1 for long sequences.

Two Attention Paths in vLLM

vLLM uses two completely different attention implementations depending on the phase:

📊

vLLM's Two Attention Paths

Property	Prefill Path	Decode Path
Kernel	FlashAttention-2	PagedAttention v1/v2
KV layout	Contiguous tensor	Non-contiguous blocks via block table
Q tokens per call	Full prompt (hundreds-thousands)	1 per sequence
Bottleneck	Compute (large GEMMs)	Memory bandwidth (reading KV)
Why this kernel	Maximum throughput for batch GEMM	Must handle paged memory layout

Why two paths? During prefill, all KV entries are freshly computed and stored contiguously — FlashAttention’s tiled algorithm works perfectly. During decode, KV entries were accumulated over many iterations and stored in scattered blocks allocated on-demand by the block manager. A contiguous-memory kernel cannot read these blocks. The PagedAttention kernel can.

The Block Table: Indirection Layer

Each sequence has a block table stored in GPU memory:

block_table[seq_idx] = [block_ptr_0, block_ptr_1, block_ptr_2, ...]

Block $i$ holds KV cache for token positions $[i \times B, (i+1) \times B)$ where $B$ is the block size (default 16 tokens). To find the KV for token $t$ :

block_idx = t // block_size          # Which block?
block_offset = t % block_size        # Where within the block?
physical_ptr = block_table[seq][block_idx]  # Physical location in HBM
kv_address = physical_ptr + block_offset * head_dim * sizeof(dtype)

This indirection — reading the block table to find the physical address, then reading the actual KV data — is pointer chasing. It adds latency because the GPU must complete one memory read before it can issue the next. This is the fundamental cost of paging.

KV Cache Block Layout in GPU HBM

Block Pool Contiguous pool of fixed-size blocks Each block: [block_size, num_kv_heads, head_dim] per K and V

Block Table Per-sequence mapping array [max_num_seqs, max_num_blocks_per_seq] of block pointers

Free List Stack of available block indices O(1) alloc and free

PagedAttention v1: The Core Algorithm

The v1 kernel processes one query head for one sequence per thread block. Here is the algorithm:

CUDA kernel: paged_attention_v1
  Grid: [num_heads, num_seqs, 1]
  Block: [NUM_THREADS]  (typically 128 or 256)

  For each (head_idx, seq_idx):
    1. Load query vector q[head_dim] into registers
    2. Initialize: max_score = -INF, sum_exp = 0, output = zeros(head_dim)

    3. For block_idx in 0..num_blocks[seq_idx]:
       a. physical_block = block_table[seq_idx][block_idx]
       b. For token in block:
          - Load K[token] from physical_block location
          - Compute score = dot(q, K[token]) / sqrt(head_dim)
          - Online softmax update:
            new_max = max(max_score, score)
            correction = exp(max_score - new_max)
            sum_exp = sum_exp * correction + exp(score - new_max)
            output = output * correction + exp(score - new_max) * V[token]
            max_score = new_max

    4. output = output / sum_exp  # Final normalization
    5. Write output to global memory

The key operations within each thread block:

Threads cooperate to load K and V vectors (each head_dim=128 elements, distributed across threads)
Warp-level reductions compute the dot product (parallel sum across 32 threads)
Shared memory stores partial scores and the running softmax state
Online softmax (same algorithm as FlashAttention) allows processing blocks sequentially without seeing all scores first

⚡ GQA Mapping in the Kernel

With GQA (e.g., 64 query heads sharing 8 KV heads), the kernel maps each query head to its KV group: kv_head_idx = head_idx / (num_heads / num_kv_heads). Multiple thread blocks (different query heads) read the same KV block simultaneously, benefiting from L2 cache hits.

PagedAttention v2: Parallelizing Over Blocks

The v1 kernel processes all KV blocks sequentially within one thread block. For long sequences (thousands of blocks), this means one thread block does all the work — poor GPU utilization.

v2 fixes this by parallelizing across KV blocks:

CUDA kernel: paged_attention_v2
  Grid: [num_heads, num_seqs, NUM_PARTITIONS]
  Block: [NUM_THREADS]

  Phase 1: Each partition processes a subset of KV blocks
    partition_idx = blockIdx.z
    blocks_per_partition = ceil(num_blocks / NUM_PARTITIONS)
    start_block = partition_idx * blocks_per_partition
    end_block = min(start_block + blocks_per_partition, num_blocks)

    // Compute partial attention over [start_block, end_block)
    // Store: partial_max, partial_sum_exp, partial_output

  Phase 2: Reduce across partitions
    // One thread block combines all partitions' results
    // Using the online softmax rescaling trick
    for partition in partitions:
      correction = exp(partition.max - global_max)
      global_sum += partition.sum * correction
      global_output += partition.output * correction
    global_output /= global_sum

📊

PagedAttention v1 vs v2 Performance

Sequence Length	v1 Latency	v2 Latency	Speedup
512 tokens	0.8 ms	0.9 ms	0.9x (v1 faster — partition overhead)
2048 tokens	2.1 ms	1.8 ms	1.17x
8192 tokens	7.5 ms	4.2 ms	1.79x
32768 tokens	28.0 ms	12.1 ms	2.31x

Note: Measured on A100-80GB, Llama 70B, batch=1. v2 benefits grow with sequence length due to better parallelism.

v2 is faster for long sequences because it uses more of the GPU’s SMs simultaneously. The partition reduction adds overhead, so v1 can be faster for very short sequences. vLLM selects v2 by default.

The Indirection Cost

The block table lookup adds overhead compared to contiguous FlashAttention:

Extra memory read: For each KV block, load block_table[seq][block_idx] (4-8 bytes) before loading the actual KV data
Non-contiguous access: KV blocks can be anywhere in HBM, so the memory access pattern is scattered rather than sequential. This reduces L2 cache efficiency.
Reduced fusion opportunities: FlashAttention’s contiguous layout enables aggressive tiling within SRAM. Paged access makes tiling harder.

Attention Latency: FlashAttention vs PagedAttention (Llama 70B, A100)

(% of FlashAttention speed)

FlashAttention (prefill, contiguous) Baseline

100 % of FlashAttention speed

PagedAttention v2 (decode, paged) +12% overhead

112 % of FlashAttention speed

PagedAttention v1 (decode, paged) +18% overhead

118 % of FlashAttention speed

The 12-18% overhead is the “tax” for paged memory. vLLM pays this tax gladly because the paging enables 2-4x more concurrent sequences (less memory waste), which more than compensates for the per-sequence overhead.

ℹ️ Why Not Use FlashAttention for Decode Too?

FlashAttention requires contiguous K,V tensors. During decode, the KV cache has been accumulated across many iterations into scattered blocks. You could compact the blocks into a contiguous buffer before calling FlashAttention, but the compaction itself would cost more than the PagedAttention overhead. The paged kernel avoids this copy entirely.

Cache Operation Kernels

Beyond attention, vLLM has CUDA kernels for KV cache management:

reshape_and_cache: When a new token is generated, its K and V vectors must be written into the correct block at the correct offset:

// Write K[head][dim] to cache block at position (block_number, block_offset)
__global__ void reshape_and_cache_kernel(
    float* key, float* value,           // New K,V from this step
    float* key_cache, float* value_cache, // The block pool
    int* slot_mapping,                   // Maps seq position -> (block, offset)
    ...) {
    int slot = slot_mapping[token_idx];
    int block_number = slot / block_size;
    int block_offset = slot % block_size;
    // Write to the correct location in the block pool
    key_cache[block_number][block_offset][head][dim] = key[token_idx][head][dim];
}

swap_blocks: Copy blocks between GPU and CPU for preemption:

// Async GPU->CPU copy for each block being swapped out
cudaMemcpyAsync(cpu_cache + dst_block * block_bytes,
                gpu_cache + src_block * block_bytes,
                block_bytes, cudaMemcpyDeviceToHost, stream);

copy_blocks: Duplicate blocks for copy-on-write (beam search):

// Copy one block's content to another block (same GPU)
cudaMemcpyAsync(gpu_cache + dst_block * block_bytes,
                gpu_cache + src_block * block_bytes,
                block_bytes, cudaMemcpyDeviceToDevice, stream);

These kernels are simpler than the attention kernel but still performance-critical — they execute on every iteration.

FlashInfer as Alternative Backend

FlashInfer is an alternative attention backend that vLLM can use. Key differences:

📊

vLLM Native vs FlashInfer

Feature	vLLM Native PagedAttention	FlashInfer
Language	Custom CUDA (csrc/)	CUDA + Triton hybrid
Decode kernel	v1/v2 paged attention	BatchDecodeWithPagedKVCache
Prefill kernel	Delegates to FlashAttention	Own prefill implementation
FP8 KV cache	Supported	Supported with different quantization
Best for	General deployments	When FlashInfer's kernel tuning is better for your GPU

FlashInfer can be faster on specific GPU architectures because it uses different tiling strategies and thread mappings. vLLM allows switching backends via configuration.

Performance Profile

Where does time go in the attention step during decode?

Decode Attention Time Breakdown (Llama 70B, B=32, S=2048, A100)

(% of attention time)

KV cache reads from HBM Memory-bandwidth-bound

65 % of attention time

QK dot products + softmax Compute

15 % of attention time

AV multiply + accumulate Compute

12 % of attention time

Block table lookup Indirection cost

5 % of attention time

Output write Memory

3 % of attention time

KV cache reads dominate at 65% — this is the fundamental memory-bandwidth bottleneck of decode. The block table indirection adds only 5% overhead. The kernel’s performance is determined almost entirely by HBM bandwidth, which is why higher-bandwidth GPUs (H100 vs A100) directly translate to faster decode.

💡 The Performance Optimization Hierarchy for PagedAttention

To make PagedAttention faster: (1) More HBM bandwidth (hardware upgrade — biggest impact). (2) KV cache quantization to INT8/FP8 (2x less data to read — covered in Inference Timeline Part 2). (3) Fewer KV heads via GQA/MLA (less data per token — covered in Transformer Anatomy Part 6). (4) Kernel tuning (tile sizes, thread mapping — marginal). The kernel itself is already well-optimized; the bottleneck is fundamental physics.

Conclusion

The PagedAttention kernel is a masterclass in trading a small amount of compute overhead (block table indirection) for a massive system-level benefit (zero-fragmentation KV cache memory). The 12-18% attention overhead enables 2-4x more concurrent requests through better memory utilization — a clear win.

Understanding this kernel explains why vLLM makes certain architectural choices: why prefill and decode use different attention paths, why block size is 16 tokens (balancing indirection overhead vs fragmentation), and why KV cache quantization has outsized impact (reducing the dominant memory-bandwidth cost).

The three posts in this series give you a complete picture of vLLM’s internals: the architecture (Part 1), the scheduling brain (Part 2), and the attention muscle (this post). With this understanding, you can tune, debug, and extend vLLM with confidence.