A single 128K-token request on Llama 70B consumes 8,000 KV cache blocks — 40 GB of GPU memory. An A100-80GB, after loading the model itself, has only 35 GB available for KV cache. The math does not work. You cannot serve that request on that GPU without offloading. Tiered KV cache solves this by treating GPU HBM, CPU DRAM, and NVMe SSD as a unified memory hierarchy: hot blocks stay on the GPU, warm blocks spill to CPU, and cold blocks land on SSD. The latency cost is real — fetching a block from CPU adds 0.5ms, from SSD adds 10ms — but the capacity gain is exponential: 35 GB GPU becomes 500 GB with CPU offloading, or 4 TB with SSD.
This post details vLLM v1’s offloading implementation: the async transfer pipeline, eviction policy, prefetch strategy, and the measured latency costs at each memory tier.
Memory Tier Characteristics
The three tiers have dramatically different bandwidth and capacity:
Memory Tier Specifications — Typical A100 Server
| Tier | Capacity | Read BW | Write BW | Latency (4KB) | Cost/GB |
|---|---|---|---|---|---|
| GPU HBM2e | 80 GB | 2,039 GB/s | 2,039 GB/s | ~0.3 us | $$$ |
| CPU DDR5 | 512 GB | 204 GB/s | 204 GB/s | ~0.1 us | $$ |
| NVMe SSD | 3.84 TB | 7 GB/s | 5 GB/s | ~10 us | $ |
The bandwidth ratio between GPU HBM and CPU DRAM is 10x. Between GPU HBM and NVMe, it is 290x. These ratios determine how many blocks can be transferred per decode step without stalling the GPU.
Offloading Architecture
vLLM’s offloading operates at the block level. The block manager tracks which blocks are on which tier:
class TieredBlockManager:
def __init__(self, num_gpu_blocks: int, num_cpu_blocks: int,
num_ssd_blocks: int, block_size_bytes: int):
# GPU tier: pre-allocated contiguous tensor
self.gpu_pool = torch.empty(
num_gpu_blocks, block_size_bytes,
dtype=torch.uint8, device="cuda"
)
# CPU tier: pinned memory for fast DMA
self.cpu_pool = torch.empty(
num_cpu_blocks, block_size_bytes,
dtype=torch.uint8, device="cpu", pin_memory=True
)
# SSD tier: memory-mapped file
self.ssd_path = "/tmp/vllm_kv_cache.bin"
self.ssd_pool = np.memmap(
self.ssd_path, dtype=np.uint8, mode="w+",
shape=(num_ssd_blocks, block_size_bytes)
)
# Block location tracking
self.block_tier = {} # block_id -> "gpu" | "cpu" | "ssd"
self.block_last_access = {} # block_id -> step_number
self.gpu_free = list(range(num_gpu_blocks))
self.cpu_free = list(range(num_cpu_blocks))
self.ssd_free = list(range(num_ssd_blocks))
CPU memory must be pinned (pin_memory=True) for efficient DMA transfers. Pinned memory bypasses the OS page cache and allows the GPU to read/write directly via PCIe DMA engines without CPU involvement. Unpinned memory requires an extra copy through a staging buffer, halving effective bandwidth.
The Transfer Pipeline
Offloading and retrieval use CUDA streams to overlap transfers with computation:
class AsyncTransferPipeline:
def __init__(self):
self.offload_stream = torch.cuda.Stream()
self.prefetch_stream = torch.cuda.Stream()
self.pending_offloads = [] # (block_id, event)
self.pending_prefetches = [] # (block_id, event)
def offload_to_cpu(self, block_id: int, gpu_slot: int, cpu_slot: int):
"""Non-blocking GPU -> CPU transfer."""
with torch.cuda.stream(self.offload_stream):
self.cpu_pool[cpu_slot].copy_(
self.gpu_pool[gpu_slot], non_blocking=True
)
event = torch.cuda.Event()
event.record(self.offload_stream)
self.pending_offloads.append((block_id, event))
def prefetch_from_cpu(self, block_id: int, cpu_slot: int, gpu_slot: int):
"""Non-blocking CPU -> GPU transfer."""
with torch.cuda.stream(self.prefetch_stream):
self.gpu_pool[gpu_slot].copy_(
self.cpu_pool[cpu_slot], non_blocking=True
)
event = torch.cuda.Event()
event.record(self.prefetch_stream)
self.pending_prefetches.append((block_id, event))
def wait_offloads(self):
"""Block until all pending offloads complete."""
for block_id, event in self.pending_offloads:
event.synchronize()
self.pending_offloads.clear()
def wait_prefetches(self):
"""Block until all pending prefetches complete."""
for block_id, event in self.pending_prefetches:
event.synchronize()
self.pending_prefetches.clear()
The key to performance is that offload_stream and prefetch_stream operate concurrently with the compute stream. While the GPU executes the forward pass on the compute stream, offload and prefetch transfers proceed on PCIe DMA engines without stealing GPU SMs.
Transfer Time Budget
Each decode step has a fixed time budget. Transfers that exceed this budget stall the GPU:
# Decode step duration for Llama 70B, batch=64, 4xA100: ~15 ms
# Available PCIe bandwidth: 32 GB/s (Gen4 x16, bidirectional)
# Practical bandwidth with pinned memory: ~24 GB/s
# Block size for Llama 70B (per GPU with TP=4): 1.31 MB
# Transfer time per block: 1.31 MB / 24 GB/s = 54.6 us
# Blocks transferable per decode step: 15 ms / 54.6 us = 274 blocks
# At 16 tokens/block, that's 4,384 tokens of cache per step
Blocks Transferable per Decode Step (15ms budget)
Eviction Policy
When the GPU block pool is full and a new block is needed, the eviction policy selects which blocks to offload. vLLM uses an LRU (Least Recently Used) policy with priority classes:
class EvictionPolicy:
PRIORITY_ACTIVE = 0 # Currently in a running sequence's attention window
PRIORITY_RECENT = 1 # Recently used, might be needed again (prefix cache)
PRIORITY_IDLE = 2 # Belongs to a waiting/preempted sequence
PRIORITY_EVICTABLE = 3 # Safe to offload
def select_victims(self, num_needed: int,
block_metadata: dict) -> list:
"""Select blocks to evict from GPU to CPU."""
candidates = []
for block_id, meta in block_metadata.items():
if meta["tier"] != "gpu":
continue
if meta["ref_count"] > 0 and meta["priority"] == self.PRIORITY_ACTIVE:
continue # Never evict blocks in active attention
candidates.append((
meta["priority"], # Primary sort: lower priority = evict last
meta["last_access"], # Secondary sort: LRU within priority
block_id
))
candidates.sort(key=lambda x: (-x[0], x[1]))
return [c[2] for c in candidates[:num_needed]]
The priority system ensures:
- Blocks actively used by the current batch are never evicted.
- Prefix-cached blocks (shared across sequences) are preferred over idle blocks because evicting them forces recomputation.
- Blocks from preempted sequences are evicted first since those sequences are already stalled.
Prefetch Strategy
The prefetch strategy predicts which offloaded blocks will be needed and retrieves them before the compute step:
class PrefetchScheduler:
def __init__(self, lookahead: int = 2):
self.lookahead = lookahead
def compute_prefetch_set(self, scheduled_batch, block_tables):
"""Determine which blocks need to be on GPU for the next N steps."""
prefetch_blocks = set()
for seq_id in scheduled_batch.seq_ids:
table = block_tables[seq_id]
# Current decode position
current_block_idx = table.num_tokens // table.block_size
# Prefetch current block + lookahead blocks
for i in range(self.lookahead + 1):
block_idx = current_block_idx + i
if block_idx < len(table.block_ids):
block_id = table.block_ids[block_idx]
if self.block_manager.get_tier(block_id) != "gpu":
prefetch_blocks.add(block_id)
return prefetch_blocks
def execute_prefetch(self, prefetch_blocks):
"""Prefetch blocks from CPU/SSD to GPU."""
# Sort by tier: CPU first (faster), then SSD
cpu_blocks = [b for b in prefetch_blocks
if self.block_manager.get_tier(b) == "cpu"]
ssd_blocks = [b for b in prefetch_blocks
if self.block_manager.get_tier(b) == "ssd"]
# SSD -> CPU first (slow), then CPU -> GPU (fast)
for block_id in ssd_blocks:
self.transfer.ssd_to_cpu(block_id)
self.transfer.wait_ssd_reads()
for block_id in cpu_blocks + ssd_blocks:
gpu_slot = self.block_manager.allocate_gpu()
self.transfer.prefetch_from_cpu(block_id, gpu_slot)
The lookahead parameter controls how aggressively blocks are prefetched. A lookahead of 1 prefetches only the next block needed. A lookahead of 2-3 hides more transfer latency but consumes more GPU blocks for staging. For decode (sequential token generation), a lookahead of 1 is sufficient because the access pattern is perfectly predictable.
SSD Tier Implementation
The SSD tier uses direct I/O to bypass the OS page cache:
import os
import io
class SSDTier:
def __init__(self, path: str, num_blocks: int, block_size: int):
self.block_size = block_size
self.fd = os.open(path, os.O_RDWR | os.O_CREAT | os.O_DIRECT)
# Pre-allocate file
os.ftruncate(self.fd, num_blocks * block_size)
# Aligned buffer for O_DIRECT
self.aligned_buf = torch.empty(
block_size, dtype=torch.uint8, device="cpu", pin_memory=True
)
def write_block(self, slot: int, data: torch.Tensor):
"""Write a block to SSD at the given slot."""
offset = slot * self.block_size
# Copy to aligned buffer
self.aligned_buf.copy_(data.view(-1).cpu())
os.pwrite(self.fd, self.aligned_buf.numpy().tobytes(), offset)
def read_block(self, slot: int, dest: torch.Tensor):
"""Read a block from SSD into pinned CPU memory."""
offset = slot * self.block_size
raw = os.pread(self.fd, self.block_size, offset)
dest.copy_(torch.frombuffer(raw, dtype=torch.uint8))
O_DIRECT bypasses the page cache, which is critical because:
- KV cache blocks are large (1-5 MB) and would pollute the page cache
- Access patterns are managed by the eviction policy, not the OS
- We need predictable latency without page cache eviction jitter
SSD Transfer Pipeline
SSD reads are slower than CPU memory access, so a two-stage pipeline is used:
Step N: [GPU compute] [SSD->CPU transfer for step N+2]
Step N+1: [GPU compute] [CPU->GPU transfer for step N+2] [SSD->CPU for N+3]
Step N+2: [GPU compute using prefetched blocks] [SSD->CPU for N+4]
This requires a lookahead of 2 steps for SSD-resident blocks.
Performance Impact of Offloading
Decode Latency Impact of Offloading — Llama 70B, 4xA100, Batch=64
| Scenario | Decode Step (ms) | Overhead | Effective Throughput (tok/s) |
|---|---|---|---|
| All on GPU | 14.8 | baseline | 4,324 |
| 5% blocks from CPU | 15.1 | +2.0% | 4,238 |
| 20% blocks from CPU | 16.4 | +10.8% | 3,902 |
| 50% blocks from CPU | 19.2 | +29.7% | 3,333 |
| 5% blocks from SSD | 16.8 | +13.5% | 3,809 |
| 20% blocks from SSD | 22.5 | +52.0% | 2,844 |
Throughput Degradation by Offload Fraction
The overhead is manageable when offloading is limited to 5-20% of blocks. Beyond 50% CPU offload, the PCIe bus becomes saturated and decode steps stall waiting for transfers.
Capacity Analysis
The real benefit of offloading is increased serving capacity — more concurrent sequences:
# Without offloading (GPU only):
# 80 GB - 32.5 GB (model) - 2 GB (overhead) = 45.5 GB for KV cache
# 45.5 GB / 1.31 MB per block = 34,732 blocks
# At 4096 tokens/seq: 34,732 / (4096/16) = 135 concurrent sequences
# With CPU offloading (512 GB CPU RAM, 256 GB allocated):
# GPU: 34,732 blocks (active)
# CPU: 256 GB / 1.31 MB = 195,420 blocks (cold storage)
# Total: 230,152 blocks
# At 4096 tokens/seq: 230,152 / 256 = 899 sequences stored
# But only 135 can decode simultaneously (GPU constraint)
# The rest are paused with their KV cache preserved on CPU
# With SSD offloading (3.84 TB NVMe):
# SSD: 3.84 TB / 1.31 MB = 2,931,298 blocks
# Total: 3.16M blocks = thousands of sequences cached
Sequence Capacity by Tier Configuration — Llama 70B, TP=4
| Config | Active Seqs (GPU) | Cached Seqs (CPU) | Archived Seqs (SSD) | Total Capacity |
|---|---|---|---|---|
| GPU only | 135 | 0 | 0 | 135 |
| GPU + 256GB CPU | 135 | 764 | 0 | 899 |
| GPU + 256GB CPU + 1TB SSD | 135 | 764 | 3,051 | 3,950 |
| GPU + 512GB CPU + 4TB SSD | 135 | 1,528 | 12,204 | 13,867 |
When to Use Each Tier
The decision tree for tiered offloading:
def should_enable_offloading(workload):
avg_seq_len = workload.avg_input_len + workload.avg_output_len
gpu_seq_capacity = compute_gpu_capacity(model, gpu_memory)
if workload.concurrent_requests <= gpu_seq_capacity:
return "GPU_ONLY" # Everything fits, no offloading needed
if workload.concurrent_requests <= gpu_seq_capacity * 5:
return "GPU_CPU" # CPU offloading handles moderate overflow
if workload.request_pattern == "bursty":
return "GPU_CPU_SSD" # SSD absorbs burst, CPU handles steady state
if avg_seq_len > 32768:
return "GPU_CPU" # Long contexts benefit from CPU tier
# SSD is too slow for frequent long-context retrieval
return "GPU_CPU" # Default: SSD rarely worth the complexity
SSD offloading is only beneficial for workloads with high request churn (many short-lived requests generating cache that can be archived) or extreme concurrency (thousands of paused sessions). For steady-state serving with moderate concurrency, CPU offloading alone is sufficient.
Configuration
Enable tiered offloading in vLLM:
# CPU offloading only
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--swap-space 256 # GB of CPU RAM for KV cache offloading \
--gpu-memory-utilization 0.90
# CPU + SSD (experimental)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--swap-space 256 \
--kv-cache-ssd-path /mnt/nvme/vllm_kv_cache \
--kv-cache-ssd-size 1024 # GB
The --swap-space parameter specifies how many gigabytes of CPU pinned memory to allocate for KV cache swap. This memory is reserved at startup and cannot be used by other processes.
Summary
vLLM v1’s tiered KV cache offloading extends serving capacity far beyond GPU memory limits. The GPU tier (HBM) holds active blocks at 2 TB/s bandwidth. The CPU tier (pinned DRAM) holds cold blocks with 24 GB/s effective PCIe transfer bandwidth, adding 2% latency overhead at 5% offload rate and 30% at 50%. The SSD tier (NVMe) provides archival storage at 7 GB/s with a 2-step prefetch pipeline. The eviction policy uses priority classes (active, recent, idle, evictable) with LRU ordering within each class. Async CUDA streams overlap transfers with compute, making offloading nearly invisible when the offload fraction stays below 20%. The primary use case is serving many concurrent long-context sessions where total KV cache exceeds GPU memory but only a fraction of sessions are actively generating tokens.