vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory

Part of Series vLLM v1 & Omni Internals 22 of 25

1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

A single 128K-token request on Llama 70B consumes 8,000 KV cache blocks — 40 GB of GPU memory. An A100-80GB, after loading the model itself, has only 35 GB available for KV cache. The math does not work. You cannot serve that request on that GPU without offloading. Tiered KV cache solves this by treating GPU HBM, CPU DRAM, and NVMe SSD as a unified memory hierarchy: hot blocks stay on the GPU, warm blocks spill to CPU, and cold blocks land on SSD. The latency cost is real — fetching a block from CPU adds 0.5ms, from SSD adds 10ms — but the capacity gain is exponential: 35 GB GPU becomes 500 GB with CPU offloading, or 4 TB with SSD.

This post details vLLM v1’s offloading implementation: the async transfer pipeline, eviction policy, prefetch strategy, and the measured latency costs at each memory tier.

Memory Tier Characteristics

The three tiers have dramatically different bandwidth and capacity:

📊

Memory Tier Specifications — Typical A100 Server

Tier	Capacity	Read BW	Write BW	Latency (4KB)	Cost/GB
GPU HBM2e	80 GB	2,039 GB/s	2,039 GB/s	~0.3 us	$$$
CPU DDR5	512 GB	204 GB/s	204 GB/s	~0.1 us	$$
NVMe SSD	3.84 TB	7 GB/s	5 GB/s	~10 us	$

The bandwidth ratio between GPU HBM and CPU DRAM is 10x. Between GPU HBM and NVMe, it is 290x. These ratios determine how many blocks can be transferred per decode step without stalling the GPU.

Offloading Architecture

vLLM’s offloading operates at the block level. The block manager tracks which blocks are on which tier:

class TieredBlockManager:
    def __init__(self, num_gpu_blocks: int, num_cpu_blocks: int,
                 num_ssd_blocks: int, block_size_bytes: int):
        # GPU tier: pre-allocated contiguous tensor
        self.gpu_pool = torch.empty(
            num_gpu_blocks, block_size_bytes,
            dtype=torch.uint8, device="cuda"
        )
        # CPU tier: pinned memory for fast DMA
        self.cpu_pool = torch.empty(
            num_cpu_blocks, block_size_bytes,
            dtype=torch.uint8, device="cpu", pin_memory=True
        )
        # SSD tier: memory-mapped file
        self.ssd_path = "/tmp/vllm_kv_cache.bin"
        self.ssd_pool = np.memmap(
            self.ssd_path, dtype=np.uint8, mode="w+",
            shape=(num_ssd_blocks, block_size_bytes)
        )

        # Block location tracking
        self.block_tier = {}   # block_id -> "gpu" | "cpu" | "ssd"
        self.block_last_access = {}  # block_id -> step_number
        self.gpu_free = list(range(num_gpu_blocks))
        self.cpu_free = list(range(num_cpu_blocks))
        self.ssd_free = list(range(num_ssd_blocks))

ℹ️ Note

CPU memory must be pinned (pin_memory=True) for efficient DMA transfers. Pinned memory bypasses the OS page cache and allows the GPU to read/write directly via PCIe DMA engines without CPU involvement. Unpinned memory requires an extra copy through a staging buffer, halving effective bandwidth.

The Transfer Pipeline

Offloading and retrieval use CUDA streams to overlap transfers with computation:

class AsyncTransferPipeline:
    def __init__(self):
        self.offload_stream = torch.cuda.Stream()
        self.prefetch_stream = torch.cuda.Stream()
        self.pending_offloads = []  # (block_id, event)
        self.pending_prefetches = []  # (block_id, event)

    def offload_to_cpu(self, block_id: int, gpu_slot: int, cpu_slot: int):
        """Non-blocking GPU -> CPU transfer."""
        with torch.cuda.stream(self.offload_stream):
            self.cpu_pool[cpu_slot].copy_(
                self.gpu_pool[gpu_slot], non_blocking=True
            )
            event = torch.cuda.Event()
            event.record(self.offload_stream)
            self.pending_offloads.append((block_id, event))

    def prefetch_from_cpu(self, block_id: int, cpu_slot: int, gpu_slot: int):
        """Non-blocking CPU -> GPU transfer."""
        with torch.cuda.stream(self.prefetch_stream):
            self.gpu_pool[gpu_slot].copy_(
                self.cpu_pool[cpu_slot], non_blocking=True
            )
            event = torch.cuda.Event()
            event.record(self.prefetch_stream)
            self.pending_prefetches.append((block_id, event))

    def wait_offloads(self):
        """Block until all pending offloads complete."""
        for block_id, event in self.pending_offloads:
            event.synchronize()
        self.pending_offloads.clear()

    def wait_prefetches(self):
        """Block until all pending prefetches complete."""
        for block_id, event in self.pending_prefetches:
            event.synchronize()
        self.pending_prefetches.clear()

The key to performance is that offload_stream and prefetch_stream operate concurrently with the compute stream. While the GPU executes the forward pass on the compute stream, offload and prefetch transfers proceed on PCIe DMA engines without stealing GPU SMs.

Transfer Time Budget

Each decode step has a fixed time budget. Transfers that exceed this budget stall the GPU:

# Decode step duration for Llama 70B, batch=64, 4xA100: ~15 ms
# Available PCIe bandwidth: 32 GB/s (Gen4 x16, bidirectional)
# Practical bandwidth with pinned memory: ~24 GB/s

# Block size for Llama 70B (per GPU with TP=4): 1.31 MB
# Transfer time per block: 1.31 MB / 24 GB/s = 54.6 us
# Blocks transferable per decode step: 15 ms / 54.6 us = 274 blocks

# At 16 tokens/block, that's 4,384 tokens of cache per step

Blocks Transferable per Decode Step (15ms budget)

PCIe Gen4 x16

274

PCIe Gen5 x16

548

CXL 2.0

685

NVMe (7 GB/s)

Eviction Policy

When the GPU block pool is full and a new block is needed, the eviction policy selects which blocks to offload. vLLM uses an LRU (Least Recently Used) policy with priority classes:

class EvictionPolicy:
    PRIORITY_ACTIVE = 0     # Currently in a running sequence's attention window
    PRIORITY_RECENT = 1     # Recently used, might be needed again (prefix cache)
    PRIORITY_IDLE = 2       # Belongs to a waiting/preempted sequence
    PRIORITY_EVICTABLE = 3  # Safe to offload

    def select_victims(self, num_needed: int,
                       block_metadata: dict) -> list:
        """Select blocks to evict from GPU to CPU."""
        candidates = []
        for block_id, meta in block_metadata.items():
            if meta["tier"] != "gpu":
                continue
            if meta["ref_count"] > 0 and meta["priority"] == self.PRIORITY_ACTIVE:
                continue  # Never evict blocks in active attention
            candidates.append((
                meta["priority"],       # Primary sort: lower priority = evict last
                meta["last_access"],    # Secondary sort: LRU within priority
                block_id
            ))

        candidates.sort(key=lambda x: (-x[0], x[1]))
        return [c[2] for c in candidates[:num_needed]]

The priority system ensures:

Blocks actively used by the current batch are never evicted.
Prefix-cached blocks (shared across sequences) are preferred over idle blocks because evicting them forces recomputation.
Blocks from preempted sequences are evicted first since those sequences are already stalled.

Prefetch Strategy

The prefetch strategy predicts which offloaded blocks will be needed and retrieves them before the compute step:

class PrefetchScheduler:
    def __init__(self, lookahead: int = 2):
        self.lookahead = lookahead

    def compute_prefetch_set(self, scheduled_batch, block_tables):
        """Determine which blocks need to be on GPU for the next N steps."""
        prefetch_blocks = set()

        for seq_id in scheduled_batch.seq_ids:
            table = block_tables[seq_id]
            # Current decode position
            current_block_idx = table.num_tokens // table.block_size

            # Prefetch current block + lookahead blocks
            for i in range(self.lookahead + 1):
                block_idx = current_block_idx + i
                if block_idx < len(table.block_ids):
                    block_id = table.block_ids[block_idx]
                    if self.block_manager.get_tier(block_id) != "gpu":
                        prefetch_blocks.add(block_id)

        return prefetch_blocks

    def execute_prefetch(self, prefetch_blocks):
        """Prefetch blocks from CPU/SSD to GPU."""
        # Sort by tier: CPU first (faster), then SSD
        cpu_blocks = [b for b in prefetch_blocks
                      if self.block_manager.get_tier(b) == "cpu"]
        ssd_blocks = [b for b in prefetch_blocks
                      if self.block_manager.get_tier(b) == "ssd"]

        # SSD -> CPU first (slow), then CPU -> GPU (fast)
        for block_id in ssd_blocks:
            self.transfer.ssd_to_cpu(block_id)
        self.transfer.wait_ssd_reads()

        for block_id in cpu_blocks + ssd_blocks:
            gpu_slot = self.block_manager.allocate_gpu()
            self.transfer.prefetch_from_cpu(block_id, gpu_slot)

⚡ Performance

The lookahead parameter controls how aggressively blocks are prefetched. A lookahead of 1 prefetches only the next block needed. A lookahead of 2-3 hides more transfer latency but consumes more GPU blocks for staging. For decode (sequential token generation), a lookahead of 1 is sufficient because the access pattern is perfectly predictable.

SSD Tier Implementation

The SSD tier uses direct I/O to bypass the OS page cache:

import os
import io

class SSDTier:
    def __init__(self, path: str, num_blocks: int, block_size: int):
        self.block_size = block_size
        self.fd = os.open(path, os.O_RDWR | os.O_CREAT | os.O_DIRECT)
        # Pre-allocate file
        os.ftruncate(self.fd, num_blocks * block_size)
        # Aligned buffer for O_DIRECT
        self.aligned_buf = torch.empty(
            block_size, dtype=torch.uint8, device="cpu", pin_memory=True
        )

    def write_block(self, slot: int, data: torch.Tensor):
        """Write a block to SSD at the given slot."""
        offset = slot * self.block_size
        # Copy to aligned buffer
        self.aligned_buf.copy_(data.view(-1).cpu())
        os.pwrite(self.fd, self.aligned_buf.numpy().tobytes(), offset)

    def read_block(self, slot: int, dest: torch.Tensor):
        """Read a block from SSD into pinned CPU memory."""
        offset = slot * self.block_size
        raw = os.pread(self.fd, self.block_size, offset)
        dest.copy_(torch.frombuffer(raw, dtype=torch.uint8))

O_DIRECT bypasses the page cache, which is critical because:

KV cache blocks are large (1-5 MB) and would pollute the page cache
Access patterns are managed by the eviction policy, not the OS
We need predictable latency without page cache eviction jitter

SSD Transfer Pipeline

SSD reads are slower than CPU memory access, so a two-stage pipeline is used:

Step N:   [GPU compute] [SSD->CPU transfer for step N+2]
Step N+1: [GPU compute] [CPU->GPU transfer for step N+2] [SSD->CPU for N+3]
Step N+2: [GPU compute using prefetched blocks] [SSD->CPU for N+4]

This requires a lookahead of 2 steps for SSD-resident blocks.

Performance Impact of Offloading

📊

Decode Latency Impact of Offloading — Llama 70B, 4xA100, Batch=64

Scenario	Decode Step (ms)	Overhead	Effective Throughput (tok/s)
All on GPU	14.8	baseline	4,324
5% blocks from CPU	15.1	+2.0%	4,238
20% blocks from CPU	16.4	+10.8%	3,902
50% blocks from CPU	19.2	+29.7%	3,333
5% blocks from SSD	16.8	+13.5%	3,809
20% blocks from SSD	22.5	+52.0%	2,844

Throughput Degradation by Offload Fraction

All GPU

4,324

5% CPU

4,238

20% CPU

3,902

50% CPU

3,333

5% SSD

3,809

20% SSD

2,844

The overhead is manageable when offloading is limited to 5-20% of blocks. Beyond 50% CPU offload, the PCIe bus becomes saturated and decode steps stall waiting for transfers.

Capacity Analysis

The real benefit of offloading is increased serving capacity — more concurrent sequences:

# Without offloading (GPU only):
# 80 GB - 32.5 GB (model) - 2 GB (overhead) = 45.5 GB for KV cache
# 45.5 GB / 1.31 MB per block = 34,732 blocks
# At 4096 tokens/seq: 34,732 / (4096/16) = 135 concurrent sequences

# With CPU offloading (512 GB CPU RAM, 256 GB allocated):
# GPU: 34,732 blocks (active)
# CPU: 256 GB / 1.31 MB = 195,420 blocks (cold storage)
# Total: 230,152 blocks
# At 4096 tokens/seq: 230,152 / 256 = 899 sequences stored
# But only 135 can decode simultaneously (GPU constraint)
# The rest are paused with their KV cache preserved on CPU

# With SSD offloading (3.84 TB NVMe):
# SSD: 3.84 TB / 1.31 MB = 2,931,298 blocks
# Total: 3.16M blocks = thousands of sequences cached

📊

Sequence Capacity by Tier Configuration — Llama 70B, TP=4

Config	Active Seqs (GPU)	Cached Seqs (CPU)	Archived Seqs (SSD)	Total Capacity
GPU only	135	0	0	135
GPU + 256GB CPU	135	764	0	899
GPU + 256GB CPU + 1TB SSD	135	764	3,051	3,950
GPU + 512GB CPU + 4TB SSD	135	1,528	12,204	13,867

When to Use Each Tier

The decision tree for tiered offloading:

def should_enable_offloading(workload):
    avg_seq_len = workload.avg_input_len + workload.avg_output_len
    gpu_seq_capacity = compute_gpu_capacity(model, gpu_memory)

    if workload.concurrent_requests <= gpu_seq_capacity:
        return "GPU_ONLY"  # Everything fits, no offloading needed

    if workload.concurrent_requests <= gpu_seq_capacity * 5:
        return "GPU_CPU"   # CPU offloading handles moderate overflow

    if workload.request_pattern == "bursty":
        return "GPU_CPU_SSD"  # SSD absorbs burst, CPU handles steady state

    if avg_seq_len > 32768:
        return "GPU_CPU"   # Long contexts benefit from CPU tier
        # SSD is too slow for frequent long-context retrieval

    return "GPU_CPU"  # Default: SSD rarely worth the complexity

⚠️ Warning

SSD offloading is only beneficial for workloads with high request churn (many short-lived requests generating cache that can be archived) or extreme concurrency (thousands of paused sessions). For steady-state serving with moderate concurrency, CPU offloading alone is sufficient.

Configuration

Enable tiered offloading in vLLM:

# CPU offloading only
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --swap-space 256  # GB of CPU RAM for KV cache offloading \
    --gpu-memory-utilization 0.90

# CPU + SSD (experimental)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --swap-space 256 \
    --kv-cache-ssd-path /mnt/nvme/vllm_kv_cache \
    --kv-cache-ssd-size 1024  # GB

The --swap-space parameter specifies how many gigabytes of CPU pinned memory to allocate for KV cache swap. This memory is reserved at startup and cannot be used by other processes.

Summary

vLLM v1’s tiered KV cache offloading extends serving capacity far beyond GPU memory limits. The GPU tier (HBM) holds active blocks at 2 TB/s bandwidth. The CPU tier (pinned DRAM) holds cold blocks with 24 GB/s effective PCIe transfer bandwidth, adding 2% latency overhead at 5% offload rate and 30% at 50%. The SSD tier (NVMe) provides archival storage at 7 GB/s with a 2-step prefetch pipeline. The eviction policy uses priority classes (active, recent, idle, evictable) with LRU ordering within each class. Async CUDA streams overlap transfers with compute, making offloading nearly invisible when the offload fraction stays below 20%. The primary use case is serving many concurrent long-context sessions where total KV cache exceeds GPU memory but only a fraction of sessions are actively generating tokens.