GPU HBM is the most expensive real estate in the data center. An H100 with 80 GB of memory costs $30,000. After loading a 70B model’s weights, you have maybe 45 GB left for KV cache — enough to hold 137,000 tokens of conversation history across all active requests. When you’re serving a chat application where users can reference 50,000 tokens of conversation history, that 45 GB fills up fast. The naive solution is to evict old KV cache to make room for new requests, forcing you to recompute everything if the user references an old message. But here’s the insight: your server has 512 GB of CPU DRAM sitting mostly empty, 4-16 TB of NVMe SSD with single-digit millisecond access times, and potentially hundreds of other GPUs in the cluster with their own idle HBM. Dynamo’s KVBM treats all of this as a four-tier memory hierarchy where KV cache blocks can live anywhere — and migrates them between tiers based on access recency, transfer cost, and capacity constraints.
vLLM’s block manager operates within a single GPU: blocks live in HBM or get swapped to CPU DRAM. Dynamo’s KV Block Manager (KVBM) extends this to a four-tier hierarchy spanning the entire cluster. A KV cache block can reside in GPU HBM, CPU DRAM, NVMe SSD, or on a remote GPU — and KVBM decides where each block should live based on access recency, transfer cost, and capacity constraints.
The Four-Tier Architecture
KVBM Tier Hierarchy
Block Size and Transfer Math
Each KV cache block holds tokens (default 16). For Llama 70B with GQA-8:
Block Transfer Latency by Tier
| Transfer Path | Bandwidth | 5.24 MB Block | Decision Threshold |
|---|---|---|---|
| HBM read (local) | 3,350 GB/s | 1.6 us | Always fastest |
| HBM to CPU DRAM | 28 GB/s (PCIe Gen5) | 187 us | If idle for 100+ iterations |
| CPU DRAM to HBM (restore) | 28 GB/s | 187 us | Request resumes |
| CPU to NVMe SSD | 7 GB/s | 749 us | If idle for 1000+ iterations |
| NVMe to CPU to HBM | 7 GB/s + 28 GB/s | 936 us total | Cold restore path |
| Remote GPU via NVLink | 900 GB/s | 5.8 us | Cross-GPU cache hit |
| Remote GPU via IB NDR | 50 GB/s | 105 us | Cross-node cache hit |
The KVBM Implementation
from enum import Enum
from dataclasses import dataclass, field
from collections import OrderedDict
import asyncio
class Tier(Enum):
GPU_HBM = 0
CPU_DRAM = 1
NVME_SSD = 2
REMOTE_GPU = 3
@dataclass
class KVBlock:
block_id: int
tier: Tier
physical_addr: int # Memory address in current tier
sequence_id: int
block_index: int # Position within the sequence
ref_count: int = 1
last_access_iter: int = 0 # Iteration when last accessed
class KVBM:
"""Multi-tier KV Block Manager for Dynamo."""
def __init__(
self,
gpu_capacity_blocks: int,
cpu_capacity_blocks: int,
ssd_capacity_blocks: int,
gpu_watermark: float = 0.90, # Trigger offload at 90%
):
self.gpu_capacity = gpu_capacity_blocks
self.cpu_capacity = cpu_capacity_blocks
self.ssd_capacity = ssd_capacity_blocks
self.watermark = gpu_watermark
# Per-tier block storage (LRU order)
self.gpu_blocks = OrderedDict() # block_id -> KVBlock
self.cpu_blocks = OrderedDict()
self.ssd_blocks = OrderedDict()
# Free lists
self.gpu_free = list(range(gpu_capacity_blocks))
self.cpu_free = list(range(cpu_capacity_blocks))
self.ssd_free = list(range(ssd_capacity_blocks))
self.current_iter = 0
def allocate(self, sequence_id, num_blocks):
"""Allocate GPU blocks for a new sequence."""
if len(self.gpu_free) < num_blocks:
self._offload_to_cpu(num_blocks - len(self.gpu_free))
allocated = []
for i in range(num_blocks):
block_id = self.gpu_free.pop()
block = KVBlock(
block_id=block_id,
tier=Tier.GPU_HBM,
physical_addr=block_id * BLOCK_SIZE_BYTES,
sequence_id=sequence_id,
block_index=i,
last_access_iter=self.current_iter,
)
self.gpu_blocks[block_id] = block
allocated.append(block)
return allocated
def access(self, block_id):
"""Mark block as accessed (update LRU position)."""
if block_id in self.gpu_blocks:
self.gpu_blocks.move_to_end(block_id)
self.gpu_blocks[block_id].last_access_iter = self.current_iter
def _offload_to_cpu(self, num_blocks):
"""Offload LRU GPU blocks to CPU DRAM."""
for _ in range(num_blocks):
if not self.gpu_blocks:
break
# Evict least recently used GPU block
block_id, block = self.gpu_blocks.popitem(last=False)
if len(self.cpu_free) == 0:
self._offload_to_ssd(1) # Cascade to SSD
cpu_slot = self.cpu_free.pop()
block.tier = Tier.CPU_DRAM
block.physical_addr = cpu_slot * BLOCK_SIZE_BYTES
self.cpu_blocks[block_id] = block
self.gpu_free.append(block_id)
# Initiate async DMA: GPU HBM -> CPU DRAM
# cuda.memcpy_async(dst=cpu_addr, src=gpu_addr, size=BLOCK_SIZE_BYTES)
def _offload_to_ssd(self, num_blocks):
"""Offload LRU CPU blocks to NVMe SSD."""
for _ in range(num_blocks):
if not self.cpu_blocks:
break
block_id, block = self.cpu_blocks.popitem(last=False)
ssd_slot = self.ssd_free.pop()
block.tier = Tier.NVME_SSD
block.physical_addr = ssd_slot * BLOCK_SIZE_BYTES
self.ssd_blocks[block_id] = block
self.cpu_free.append(block_id)
def restore_to_gpu(self, block_id):
"""Bring an offloaded block back to GPU HBM."""
if block_id in self.cpu_blocks:
block = self.cpu_blocks.pop(block_id)
# Transfer: CPU -> GPU (187 us for 5.24 MB)
elif block_id in self.ssd_blocks:
block = self.ssd_blocks.pop(block_id)
# Transfer: SSD -> CPU -> GPU (936 us total)
else:
raise KeyError(f"Block {block_id} not found in any tier")
if len(self.gpu_free) == 0:
self._offload_to_cpu(1)
gpu_slot = self.gpu_free.pop()
block.tier = Tier.GPU_HBM
block.physical_addr = gpu_slot * BLOCK_SIZE_BYTES
block.last_access_iter = self.current_iter
self.gpu_blocks[block_id] = block
return block
def check_watermark(self):
"""Proactively offload if GPU occupancy exceeds watermark."""
occupancy = 1.0 - len(self.gpu_free) / self.gpu_capacity
if occupancy > self.watermark:
excess = int((occupancy - self.watermark) * self.gpu_capacity) + 1
self._offload_to_cpu(excess)
def step(self):
"""Called each iteration. Check watermarks, update counters."""
self.current_iter += 1
self.check_watermark()
The watermark threshold (90%) reserves 10% of GPU blocks as headroom for incoming requests. Without it, every new request would trigger synchronous offloading — blocking the forward pass while blocks transfer to CPU. With the watermark, proactive offloading happens asynchronously between iterations, ensuring blocks are already available when needed.
Async DMA Pipeline
Offloading must not block inference. KVBM uses dedicated CUDA streams for transfers:
# Dedicated streams for tier transfers
offload_stream = torch.cuda.Stream() # GPU -> CPU
restore_stream = torch.cuda.Stream() # CPU -> GPU
ssd_stream = torch.cuda.Stream() # CPU <-> SSD
def async_offload(gpu_addr, cpu_addr, size):
with torch.cuda.stream(offload_stream):
# Non-blocking GPU -> CPU transfer
torch.cuda.memcpy_async(cpu_addr, gpu_addr, size,
kind="device_to_host")
# Record event for completion tracking
return offload_stream.record_event()
def async_restore(cpu_addr, gpu_addr, size):
with torch.cuda.stream(restore_stream):
torch.cuda.memcpy_async(gpu_addr, cpu_addr, size,
kind="host_to_device")
return restore_stream.record_event()
The key: offloading runs on offload_stream while inference runs on the default stream. They execute concurrently. The scheduler ensures that a block being offloaded is not accessed until the transfer completes (tracked via CUDA events).