The PagedAttention kernel is the single most important piece of CUDA code in vLLM. It is what makes paged KV cache — the virtual memory system for attention — actually work at GPU speeds. Without this kernel, the block table indirection would be prohibitively slow, and vLLM’s entire memory management architecture would collapse.

This post goes inside the kernel: how it maps thread blocks to attention heads, how it iterates over KV cache blocks via the block table, how it computes online softmax with non-contiguous memory, and why the v2 kernel is faster than v1 for long sequences.

Two Attention Paths in vLLM

vLLM uses two completely different attention implementations depending on the phase:

📊

vLLM's Two Attention Paths

PropertyPrefill PathDecode Path
Kernel FlashAttention-2 PagedAttention v1/v2
KV layout Contiguous tensor Non-contiguous blocks via block table
Q tokens per call Full prompt (hundreds-thousands) 1 per sequence
Bottleneck Compute (large GEMMs) Memory bandwidth (reading KV)
Why this kernel Maximum throughput for batch GEMM Must handle paged memory layout

Why two paths? During prefill, all KV entries are freshly computed and stored contiguously — FlashAttention’s tiled algorithm works perfectly. During decode, KV entries were accumulated over many iterations and stored in scattered blocks allocated on-demand by the block manager. A contiguous-memory kernel cannot read these blocks. The PagedAttention kernel can.

The Block Table: Indirection Layer

Each sequence has a block table stored in GPU memory:

block_table[seq_idx] = [block_ptr_0, block_ptr_1, block_ptr_2, ...]

Block ii holds KV cache for token positions [i×B,(i+1)×B)[i \times B, (i+1) \times B) where BB is the block size (default 16 tokens). To find the KV for token tt:

block_idx = t // block_size          # Which block?
block_offset = t % block_size        # Where within the block?
physical_ptr = block_table[seq][block_idx]  # Physical location in HBM
kv_address = physical_ptr + block_offset * head_dim * sizeof(dtype)

This indirection — reading the block table to find the physical address, then reading the actual KV data — is pointer chasing. It adds latency because the GPU must complete one memory read before it can issue the next. This is the fundamental cost of paging.

KV Cache Block Layout in GPU HBM

Block Pool Contiguous pool of fixed-size blocks Each block: [block_size, num_kv_heads, head_dim] per K and V
Block Table Per-sequence mapping array [max_num_seqs, max_num_blocks_per_seq] of block pointers
Free List Stack of available block indices O(1) alloc and free

PagedAttention v1: The Core Algorithm

The v1 kernel processes one query head for one sequence per thread block. Here is the algorithm:

CUDA kernel: paged_attention_v1
  Grid: [num_heads, num_seqs, 1]
  Block: [NUM_THREADS]  (typically 128 or 256)

  For each (head_idx, seq_idx):
    1. Load query vector q[head_dim] into registers
    2. Initialize: max_score = -INF, sum_exp = 0, output = zeros(head_dim)

    3. For block_idx in 0..num_blocks[seq_idx]:
       a. physical_block = block_table[seq_idx][block_idx]
       b. For token in block:
          - Load K[token] from physical_block location
          - Compute score = dot(q, K[token]) / sqrt(head_dim)
          - Online softmax update:
            new_max = max(max_score, score)
            correction = exp(max_score - new_max)
            sum_exp = sum_exp * correction + exp(score - new_max)
            output = output * correction + exp(score - new_max) * V[token]
            max_score = new_max

    4. output = output / sum_exp  # Final normalization
    5. Write output to global memory

The key operations within each thread block:

  • Threads cooperate to load K and V vectors (each head_dim=128 elements, distributed across threads)
  • Warp-level reductions compute the dot product (parallel sum across 32 threads)
  • Shared memory stores partial scores and the running softmax state
  • Online softmax (same algorithm as FlashAttention) allows processing blocks sequentially without seeing all scores first
GQA Mapping in the Kernel

With GQA (e.g., 64 query heads sharing 8 KV heads), the kernel maps each query head to its KV group: kv_head_idx = head_idx / (num_heads / num_kv_heads). Multiple thread blocks (different query heads) read the same KV block simultaneously, benefiting from L2 cache hits.

PagedAttention v2: Parallelizing Over Blocks

The v1 kernel processes all KV blocks sequentially within one thread block. For long sequences (thousands of blocks), this means one thread block does all the work — poor GPU utilization.

v2 fixes this by parallelizing across KV blocks:

CUDA kernel: paged_attention_v2
  Grid: [num_heads, num_seqs, NUM_PARTITIONS]
  Block: [NUM_THREADS]

  Phase 1: Each partition processes a subset of KV blocks
    partition_idx = blockIdx.z
    blocks_per_partition = ceil(num_blocks / NUM_PARTITIONS)
    start_block = partition_idx * blocks_per_partition
    end_block = min(start_block + blocks_per_partition, num_blocks)

    // Compute partial attention over [start_block, end_block)
    // Store: partial_max, partial_sum_exp, partial_output

  Phase 2: Reduce across partitions
    // One thread block combines all partitions' results
    // Using the online softmax rescaling trick
    for partition in partitions:
      correction = exp(partition.max - global_max)
      global_sum += partition.sum * correction
      global_output += partition.output * correction
    global_output /= global_sum
📊

PagedAttention v1 vs v2 Performance

Sequence Lengthv1 Latencyv2 LatencySpeedup
512 tokens 0.8 ms 0.9 ms 0.9x (v1 faster — partition overhead)
2048 tokens 2.1 ms 1.8 ms 1.17x
8192 tokens 7.5 ms 4.2 ms 1.79x
32768 tokens 28.0 ms 12.1 ms 2.31x
Note: Measured on A100-80GB, Llama 70B, batch=1. v2 benefits grow with sequence length due to better parallelism.

v2 is faster for long sequences because it uses more of the GPU’s SMs simultaneously. The partition reduction adds overhead, so v1 can be faster for very short sequences. vLLM selects v2 by default.

The Indirection Cost

The block table lookup adds overhead compared to contiguous FlashAttention:

  1. Extra memory read: For each KV block, load block_table[seq][block_idx] (4-8 bytes) before loading the actual KV data
  2. Non-contiguous access: KV blocks can be anywhere in HBM, so the memory access pattern is scattered rather than sequential. This reduces L2 cache efficiency.
  3. Reduced fusion opportunities: FlashAttention’s contiguous layout enables aggressive tiling within SRAM. Paged access makes tiling harder.

Attention Latency: FlashAttention vs PagedAttention (Llama 70B, A100)

(% of FlashAttention speed)
FlashAttention (prefill, contiguous) Baseline
100 % of FlashAttention speed
PagedAttention v2 (decode, paged) +12% overhead
112 % of FlashAttention speed
PagedAttention v1 (decode, paged) +18% overhead
118 % of FlashAttention speed

The 12-18% overhead is the “tax” for paged memory. vLLM pays this tax gladly because the paging enables 2-4x more concurrent sequences (less memory waste), which more than compensates for the per-sequence overhead.

ℹ️ Why Not Use FlashAttention for Decode Too?

FlashAttention requires contiguous K,V tensors. During decode, the KV cache has been accumulated across many iterations into scattered blocks. You could compact the blocks into a contiguous buffer before calling FlashAttention, but the compaction itself would cost more than the PagedAttention overhead. The paged kernel avoids this copy entirely.

Cache Operation Kernels

Beyond attention, vLLM has CUDA kernels for KV cache management:

reshape_and_cache: When a new token is generated, its K and V vectors must be written into the correct block at the correct offset:

// Write K[head][dim] to cache block at position (block_number, block_offset)
__global__ void reshape_and_cache_kernel(
    float* key, float* value,           // New K,V from this step
    float* key_cache, float* value_cache, // The block pool
    int* slot_mapping,                   // Maps seq position -> (block, offset)
    ...) {
    int slot = slot_mapping[token_idx];
    int block_number = slot / block_size;
    int block_offset = slot % block_size;
    // Write to the correct location in the block pool
    key_cache[block_number][block_offset][head][dim] = key[token_idx][head][dim];
}

swap_blocks: Copy blocks between GPU and CPU for preemption:

// Async GPU->CPU copy for each block being swapped out
cudaMemcpyAsync(cpu_cache + dst_block * block_bytes,
                gpu_cache + src_block * block_bytes,
                block_bytes, cudaMemcpyDeviceToHost, stream);

copy_blocks: Duplicate blocks for copy-on-write (beam search):

// Copy one block's content to another block (same GPU)
cudaMemcpyAsync(gpu_cache + dst_block * block_bytes,
                gpu_cache + src_block * block_bytes,
                block_bytes, cudaMemcpyDeviceToDevice, stream);

These kernels are simpler than the attention kernel but still performance-critical — they execute on every iteration.

FlashInfer as Alternative Backend

FlashInfer is an alternative attention backend that vLLM can use. Key differences:

📊

vLLM Native vs FlashInfer

FeaturevLLM Native PagedAttentionFlashInfer
Language Custom CUDA (csrc/) CUDA + Triton hybrid
Decode kernel v1/v2 paged attention BatchDecodeWithPagedKVCache
Prefill kernel Delegates to FlashAttention Own prefill implementation
FP8 KV cache Supported Supported with different quantization
Best for General deployments When FlashInfer's kernel tuning is better for your GPU

FlashInfer can be faster on specific GPU architectures because it uses different tiling strategies and thread mappings. vLLM allows switching backends via configuration.

Performance Profile

Where does time go in the attention step during decode?

Decode Attention Time Breakdown (Llama 70B, B=32, S=2048, A100)

(% of attention time)
KV cache reads from HBM Memory-bandwidth-bound
65 % of attention time
QK dot products + softmax Compute
15 % of attention time
AV multiply + accumulate Compute
12 % of attention time
Block table lookup Indirection cost
5 % of attention time
Output write Memory
3 % of attention time

KV cache reads dominate at 65% — this is the fundamental memory-bandwidth bottleneck of decode. The block table indirection adds only 5% overhead. The kernel’s performance is determined almost entirely by HBM bandwidth, which is why higher-bandwidth GPUs (H100 vs A100) directly translate to faster decode.

💡 The Performance Optimization Hierarchy for PagedAttention

To make PagedAttention faster: (1) More HBM bandwidth (hardware upgrade — biggest impact). (2) KV cache quantization to INT8/FP8 (2x less data to read — covered in Inference Timeline Part 2). (3) Fewer KV heads via GQA/MLA (less data per token — covered in Transformer Anatomy Part 6). (4) Kernel tuning (tile sizes, thread mapping — marginal). The kernel itself is already well-optimized; the bottleneck is fundamental physics.

Conclusion

The PagedAttention kernel is a masterclass in trading a small amount of compute overhead (block table indirection) for a massive system-level benefit (zero-fragmentation KV cache memory). The 12-18% attention overhead enables 2-4x more concurrent requests through better memory utilization — a clear win.

Understanding this kernel explains why vLLM makes certain architectural choices: why prefill and decode use different attention paths, why block size is 16 tokens (balancing indirection overhead vs fragmentation), and why KV cache quantization has outsized impact (reducing the dominant memory-bandwidth cost).

The three posts in this series give you a complete picture of vLLM’s internals: the architecture (Part 1), the scheduling brain (Part 2), and the attention muscle (this post). With this understanding, you can tune, debug, and extend vLLM with confidence.