Part of Series NVIDIA Dynamo & llm-d 4 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

GPU HBM is the most expensive real estate in the data center. An H100 with 80 GB of memory costs $30,000. After loading a 70B model’s weights, you have maybe 45 GB left for KV cache — enough to hold 137,000 tokens of conversation history across all active requests. When you’re serving a chat application where users can reference 50,000 tokens of conversation history, that 45 GB fills up fast. The naive solution is to evict old KV cache to make room for new requests, forcing you to recompute everything if the user references an old message. But here’s the insight: your server has 512 GB of CPU DRAM sitting mostly empty, 4-16 TB of NVMe SSD with single-digit millisecond access times, and potentially hundreds of other GPUs in the cluster with their own idle HBM. Dynamo’s KVBM treats all of this as a four-tier memory hierarchy where KV cache blocks can live anywhere — and migrates them between tiers based on access recency, transfer cost, and capacity constraints.

vLLM’s block manager operates within a single GPU: blocks live in HBM or get swapped to CPU DRAM. Dynamo’s KV Block Manager (KVBM) extends this to a four-tier hierarchy spanning the entire cluster. A KV cache block can reside in GPU HBM, CPU DRAM, NVMe SSD, or on a remote GPU — and KVBM decides where each block should live based on access recency, transfer cost, and capacity constraints.

The Four-Tier Architecture

KVBM Tier Hierarchy

Tier 0: GPU HBM 80 GB per GPU, 3.35 TB/s bandwidth Active sequences, hot cache. Access: 0.3 us
Tier 1: CPU DRAM 512 GB - 2 TB per node, 50 GB/s via PCIe Recently preempted, warm cache. Access: 20 us
Tier 2: NVMe SSD 4-16 TB per node, 7 GB/s Long-idle sequences, cold cache. Access: 143 us
Tier 3: Remote GPU (NVLink/IB) Cluster-wide pool, 25-900 GB/s Cross-GPU cache sharing. Access: 1.5-80 us

Block Size and Transfer Math

Each KV cache block holds BsB_s tokens (default 16). For Llama 70B with GQA-8:

block_bytes=Bs×2×nlayers×nkv_heads×dhead×dtype_bytes\text{block\_bytes} = B_s \times 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{dtype\_bytes}

=16×2×80×8×128×2=5,242,880 bytes=5.24 MB= 16 \times 2 \times 80 \times 8 \times 128 \times 2 = 5{,}242{,}880 \text{ bytes} = 5.24 \text{ MB}

📊

Block Transfer Latency by Tier

Transfer PathBandwidth5.24 MB BlockDecision Threshold
HBM read (local) 3,350 GB/s 1.6 us Always fastest
HBM to CPU DRAM 28 GB/s (PCIe Gen5) 187 us If idle for 100+ iterations
CPU DRAM to HBM (restore) 28 GB/s 187 us Request resumes
CPU to NVMe SSD 7 GB/s 749 us If idle for 1000+ iterations
NVMe to CPU to HBM 7 GB/s + 28 GB/s 936 us total Cold restore path
Remote GPU via NVLink 900 GB/s 5.8 us Cross-GPU cache hit
Remote GPU via IB NDR 50 GB/s 105 us Cross-node cache hit

The KVBM Implementation

from enum import Enum
from dataclasses import dataclass, field
from collections import OrderedDict
import asyncio

class Tier(Enum):
    GPU_HBM = 0
    CPU_DRAM = 1
    NVME_SSD = 2
    REMOTE_GPU = 3

@dataclass
class KVBlock:
    block_id: int
    tier: Tier
    physical_addr: int        # Memory address in current tier
    sequence_id: int
    block_index: int          # Position within the sequence
    ref_count: int = 1
    last_access_iter: int = 0 # Iteration when last accessed

class KVBM:
    """Multi-tier KV Block Manager for Dynamo."""

    def __init__(
        self,
        gpu_capacity_blocks: int,
        cpu_capacity_blocks: int,
        ssd_capacity_blocks: int,
        gpu_watermark: float = 0.90,  # Trigger offload at 90%
    ):
        self.gpu_capacity = gpu_capacity_blocks
        self.cpu_capacity = cpu_capacity_blocks
        self.ssd_capacity = ssd_capacity_blocks
        self.watermark = gpu_watermark

        # Per-tier block storage (LRU order)
        self.gpu_blocks = OrderedDict()  # block_id -> KVBlock
        self.cpu_blocks = OrderedDict()
        self.ssd_blocks = OrderedDict()

        # Free lists
        self.gpu_free = list(range(gpu_capacity_blocks))
        self.cpu_free = list(range(cpu_capacity_blocks))
        self.ssd_free = list(range(ssd_capacity_blocks))

        self.current_iter = 0

    def allocate(self, sequence_id, num_blocks):
        """Allocate GPU blocks for a new sequence."""
        if len(self.gpu_free) < num_blocks:
            self._offload_to_cpu(num_blocks - len(self.gpu_free))

        allocated = []
        for i in range(num_blocks):
            block_id = self.gpu_free.pop()
            block = KVBlock(
                block_id=block_id,
                tier=Tier.GPU_HBM,
                physical_addr=block_id * BLOCK_SIZE_BYTES,
                sequence_id=sequence_id,
                block_index=i,
                last_access_iter=self.current_iter,
            )
            self.gpu_blocks[block_id] = block
            allocated.append(block)
        return allocated

    def access(self, block_id):
        """Mark block as accessed (update LRU position)."""
        if block_id in self.gpu_blocks:
            self.gpu_blocks.move_to_end(block_id)
            self.gpu_blocks[block_id].last_access_iter = self.current_iter

    def _offload_to_cpu(self, num_blocks):
        """Offload LRU GPU blocks to CPU DRAM."""
        for _ in range(num_blocks):
            if not self.gpu_blocks:
                break
            # Evict least recently used GPU block
            block_id, block = self.gpu_blocks.popitem(last=False)

            if len(self.cpu_free) == 0:
                self._offload_to_ssd(1)  # Cascade to SSD

            cpu_slot = self.cpu_free.pop()
            block.tier = Tier.CPU_DRAM
            block.physical_addr = cpu_slot * BLOCK_SIZE_BYTES
            self.cpu_blocks[block_id] = block
            self.gpu_free.append(block_id)

            # Initiate async DMA: GPU HBM -> CPU DRAM
            # cuda.memcpy_async(dst=cpu_addr, src=gpu_addr, size=BLOCK_SIZE_BYTES)

    def _offload_to_ssd(self, num_blocks):
        """Offload LRU CPU blocks to NVMe SSD."""
        for _ in range(num_blocks):
            if not self.cpu_blocks:
                break
            block_id, block = self.cpu_blocks.popitem(last=False)
            ssd_slot = self.ssd_free.pop()
            block.tier = Tier.NVME_SSD
            block.physical_addr = ssd_slot * BLOCK_SIZE_BYTES
            self.ssd_blocks[block_id] = block
            self.cpu_free.append(block_id)

    def restore_to_gpu(self, block_id):
        """Bring an offloaded block back to GPU HBM."""
        if block_id in self.cpu_blocks:
            block = self.cpu_blocks.pop(block_id)
            # Transfer: CPU -> GPU (187 us for 5.24 MB)
        elif block_id in self.ssd_blocks:
            block = self.ssd_blocks.pop(block_id)
            # Transfer: SSD -> CPU -> GPU (936 us total)
        else:
            raise KeyError(f"Block {block_id} not found in any tier")

        if len(self.gpu_free) == 0:
            self._offload_to_cpu(1)

        gpu_slot = self.gpu_free.pop()
        block.tier = Tier.GPU_HBM
        block.physical_addr = gpu_slot * BLOCK_SIZE_BYTES
        block.last_access_iter = self.current_iter
        self.gpu_blocks[block_id] = block
        return block

    def check_watermark(self):
        """Proactively offload if GPU occupancy exceeds watermark."""
        occupancy = 1.0 - len(self.gpu_free) / self.gpu_capacity
        if occupancy > self.watermark:
            excess = int((occupancy - self.watermark) * self.gpu_capacity) + 1
            self._offload_to_cpu(excess)

    def step(self):
        """Called each iteration. Check watermarks, update counters."""
        self.current_iter += 1
        self.check_watermark()
The 90% Watermark

The watermark threshold (90%) reserves 10% of GPU blocks as headroom for incoming requests. Without it, every new request would trigger synchronous offloading — blocking the forward pass while blocks transfer to CPU. With the watermark, proactive offloading happens asynchronously between iterations, ensuring blocks are already available when needed.

Async DMA Pipeline

Offloading must not block inference. KVBM uses dedicated CUDA streams for transfers:

# Dedicated streams for tier transfers
offload_stream = torch.cuda.Stream()   # GPU -> CPU
restore_stream = torch.cuda.Stream()   # CPU -> GPU
ssd_stream = torch.cuda.Stream()       # CPU <-> SSD

def async_offload(gpu_addr, cpu_addr, size):
    with torch.cuda.stream(offload_stream):
        # Non-blocking GPU -> CPU transfer
        torch.cuda.memcpy_async(cpu_addr, gpu_addr, size,
                                 kind="device_to_host")
        # Record event for completion tracking
        return offload_stream.record_event()

def async_restore(cpu_addr, gpu_addr, size):
    with torch.cuda.stream(restore_stream):
        torch.cuda.memcpy_async(gpu_addr, cpu_addr, size,
                                 kind="host_to_device")
        return restore_stream.record_event()

The key: offloading runs on offload_stream while inference runs on the default stream. They execute concurrently. The scheduler ensures that a block being offloaded is not accessed until the transfer completes (tracked via CUDA events).