NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts

Part of Series NVIDIA Dynamo & llm-d 2 of 30

1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Cold starts kill autoscaling. When a traffic spike hits and Dynamo needs a new model replica, the standard path is: allocate GPU, load 140 GB of weights from storage, construct CUDA graphs, warm up the KV cache. On NVMe SSD at 14 GB/s, weight loading alone takes 10 seconds. With network storage, 30-60 seconds. By the time the replica is ready, the traffic spike may have passed.

ModelExpress eliminates this bottleneck by streaming weights directly from a GPU that already has the model to the new GPU over NVLink. NIXL handles the low-level peer-to-peer transfers. The result: cold starts under 200ms on Hopper, under 100ms on Blackwell.

The Cold Start Timeline

📊

Cold Start Breakdown: Traditional vs ModelExpress

Phase	Traditional (NVMe)	ModelExpress (NVLink 4.0)	ModelExpress (NVLink 5.0)
Weight loading	10,000 ms	156 ms	78 ms
CUDA graph construction	2,000 ms	0 ms (binary checkpoint)	0 ms
KV cache warmup	500 ms	500 ms	500 ms
Total cold start	12,500 ms	656 ms	578 ms

Note: Llama 70B FP16 (140 GB). NVLink 4.0 = 900 GB/s. NVLink 5.0 = 1.8 TB/s. CUDA graph binary checkpoint skips reconstruction entirely.

Weight Streaming Protocol

ModelExpress does not load-then-serve. It streams layer by layer:

Timeline:
  t=0ms:    Request to spin up replica on GPU_new
  t=0.5ms:  GPU_source begins NVLink DMA of Layer 0 weights
  t=2ms:    Layer 0 weights arrive on GPU_new (2 MB per layer for small layers)
  t=2ms:    GPU_new begins CUDA graph setup for Layer 0
  t=3ms:    Layer 0 ready. GPU_new can process tokens through Layer 0.
  t=3ms:    Layer 1 weights streaming in parallel
  ...
  t=156ms:  All 80 layers loaded. Full model operational.

The key optimization: pipelined streaming. Execution on early layers begins while later layers are still transferring. The effective time-to-first-token for a new replica is the time to load the FIRST layer (~2ms), not the full model (~156ms).

Transfer Math

Each layer of Llama 70B has approximately:

Attention (Q+K+V+O): $4 \times 8192^2 \times 2 = 536$ MB (but GQA reduces K,V)
FFN (SwiGLU, 3 matrices): $3 \times 8192 \times 28672 \times 2 = 1.41$ GB
Norms: negligible
Total per layer: ~1.75 GB

80 layers: 140 GB total.

📊

Per-Layer Transfer Time by Interconnect

Interconnect	Bandwidth	1.75 GB/layer	Full Model (140 GB)
NVLink 4.0 (H100)	900 GB/s	1.94 ms	156 ms
NVLink 5.0 (B200)	1,800 GB/s	0.97 ms	78 ms
NVSwitch (multi-hop)	~600 GB/s effective	2.92 ms	233 ms
InfiniBand NDR (inter-node)	50 GB/s	35 ms	2,800 ms
NVMe SSD	14 GB/s	125 ms	10,000 ms

⚡ NVLink is 64x Faster Than NVMe

The bandwidth gap between NVLink 4.0 (900 GB/s) and NVMe SSD (14 GB/s) is 64x. This single fact explains why GPU-to-GPU streaming transforms cold starts from a 10-second problem to a 156-millisecond problem.

NIXL: The Inference Exchange Layer

NIXL is the low-level library that implements GPU-to-GPU data movement for both ModelExpress (weight streaming) and KV cache transfer (disaggregated serving).

Architecture

NIXL Stack:
  Application (Dynamo Router / Planner)
      |
  NIXL API (Python + Rust bindings)
      |
  Transfer Engine (selects optimal path)
      |
  ┌──────────┬──────────┬──────────┐
  │ NVLink   │ PCIe P2P │ RDMA/IB  │
  │ (intra-  │ (intra-  │ (inter-  │
  │  node)   │  node,   │  node)   │
  │          │  slower) │          │
  └──────────┴──────────┴──────────┘
      |
  GPU Memory (source) -----> GPU Memory (destination)

Key NIXL Operations

Weight transfer (ModelExpress):

# Pseudocode for NIXL weight streaming
def stream_model_weights(source_gpu, dest_gpu, model):
    for layer_idx in range(model.num_layers):
        # Get source memory address for this layer's weights
        src_addr = source_gpu.get_layer_addr(layer_idx)
        layer_size = model.layer_size_bytes(layer_idx)

        # Initiate async DMA transfer
        transfer_handle = nixl.async_transfer(
            src_device=source_gpu.device_id,
            dst_device=dest_gpu.device_id,
            src_addr=src_addr,
            dst_addr=dest_gpu.allocate(layer_size),
            size=layer_size,
            protocol="nvlink",  # or "rdma" for inter-node
        )

        # Signal dest GPU that this layer is ready
        transfer_handle.on_complete(
            callback=lambda: dest_gpu.activate_layer(layer_idx)
        )

KV cache transfer (disaggregated prefill-decode):

def transfer_kv_cache(prefill_gpu, decode_gpu, sequence):
    """Transfer KV cache from prefill worker to decode worker."""
    kv_blocks = sequence.get_kv_block_addrs()
    total_bytes = len(kv_blocks) * BLOCK_SIZE_BYTES

    # Choose transfer protocol based on topology
    if nixl.are_nvlink_connected(prefill_gpu, decode_gpu):
        protocol = "nvlink"   # 900 GB/s, 1.5ms for 1.34 GB
    elif nixl.same_node(prefill_gpu, decode_gpu):
        protocol = "pcie_p2p" # 28 GB/s, 48ms for 1.34 GB
    else:
        protocol = "rdma"     # 50 GB/s, 27ms for 1.34 GB

    # Batch transfer all blocks
    nixl.bulk_transfer(
        src_device=prefill_gpu.device_id,
        dst_device=decode_gpu.device_id,
        src_addrs=[b.addr for b in kv_blocks],
        dst_addrs=decode_gpu.allocate_blocks(len(kv_blocks)),
        sizes=[BLOCK_SIZE_BYTES] * len(kv_blocks),
        protocol=protocol,
    )

Multi-Path Transfers

When source and destination have multiple NVLink connections (GB200 NVL72 has up to 18 NVLink connections per GPU), NIXL stripes the transfer across all available paths:

Single NVLink lane:    900 GB/s / 18 = 50 GB/s per lane
All 18 lanes:          900 GB/s aggregate
Stripe 140 GB across:  Each lane transfers ~7.8 GB
                       Wall time: 7.8 GB / 50 GB/s = 156 ms

This is the same 156ms total — NVLink bandwidth is already aggregate. But for non-uniform topologies (some GPUs connected by more lanes than others), NIXL’s path selection avoids bottlenecks.

Binary-Level Checkpointing

Traditional model loading: load weights from disk, construct PyTorch model, compile CUDA graphs. The CUDA graph construction phase takes 2-3 seconds because it must trace the model forward pass, capture kernel launches, and optimize the execution graph.

ModelExpress skips this: it transfers not just weights but the entire CUDA graph binary from the source GPU. The destination GPU receives a pre-constructed CUDA graph that maps to the same GPU architecture. No reconstruction needed.

Traditional:
  Load weights (10s) -> Build model (0.5s) -> Construct CUDA graphs (2s) -> Ready

ModelExpress:
  Stream weights (156ms) -> Restore CUDA graph binary (5ms) -> Ready

The CUDA graph binary is small (10-50 MB) compared to model weights (140 GB). It transfers in under 1ms over NVLink.

ℹ️ Same GPU Architecture Required

Binary checkpoint restore only works between GPUs of the same architecture (H100-to-H100, B200-to-B200). Cross-architecture transfers (H100-to-B200) require CUDA graph reconstruction on the destination. Dynamo tracks GPU types in its cluster topology and only uses binary checkpoints for same-architecture transfers.

The Cold Start Timeline

Cold Start Breakdown: Traditional vs ModelExpress

Weight Streaming Protocol

Transfer Math

Per-Layer Transfer Time by Interconnect

NIXL: The Inference Exchange Layer

Architecture

Key NIXL Operations

Multi-Path Transfers

Binary-Level Checkpointing

Stanley Phoong

Related Posts

NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters

Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime

Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size