Cold starts kill autoscaling. When a traffic spike hits and Dynamo needs a new model replica, the standard path is: allocate GPU, load 140 GB of weights from storage, construct CUDA graphs, warm up the KV cache. On NVMe SSD at 14 GB/s, weight loading alone takes 10 seconds. With network storage, 30-60 seconds. By the time the replica is ready, the traffic spike may have passed.
ModelExpress eliminates this bottleneck by streaming weights directly from a GPU that already has the model to the new GPU over NVLink. NIXL handles the low-level peer-to-peer transfers. The result: cold starts under 200ms on Hopper, under 100ms on Blackwell.
The Cold Start Timeline
Cold Start Breakdown: Traditional vs ModelExpress
| Phase | Traditional (NVMe) | ModelExpress (NVLink 4.0) | ModelExpress (NVLink 5.0) |
|---|---|---|---|
| Weight loading | 10,000 ms | 156 ms | 78 ms |
| CUDA graph construction | 2,000 ms | 0 ms (binary checkpoint) | 0 ms |
| KV cache warmup | 500 ms | 500 ms | 500 ms |
| Total cold start | 12,500 ms | 656 ms | 578 ms |
Weight Streaming Protocol
ModelExpress does not load-then-serve. It streams layer by layer:
Timeline:
t=0ms: Request to spin up replica on GPU_new
t=0.5ms: GPU_source begins NVLink DMA of Layer 0 weights
t=2ms: Layer 0 weights arrive on GPU_new (2 MB per layer for small layers)
t=2ms: GPU_new begins CUDA graph setup for Layer 0
t=3ms: Layer 0 ready. GPU_new can process tokens through Layer 0.
t=3ms: Layer 1 weights streaming in parallel
...
t=156ms: All 80 layers loaded. Full model operational.
The key optimization: pipelined streaming. Execution on early layers begins while later layers are still transferring. The effective time-to-first-token for a new replica is the time to load the FIRST layer (~2ms), not the full model (~156ms).
Transfer Math
Each layer of Llama 70B has approximately:
- Attention (Q+K+V+O): MB (but GQA reduces K,V)
- FFN (SwiGLU, 3 matrices): GB
- Norms: negligible
- Total per layer: ~1.75 GB
80 layers: 140 GB total.
Per-Layer Transfer Time by Interconnect
| Interconnect | Bandwidth | 1.75 GB/layer | Full Model (140 GB) |
|---|---|---|---|
| NVLink 4.0 (H100) | 900 GB/s | 1.94 ms | 156 ms |
| NVLink 5.0 (B200) | 1,800 GB/s | 0.97 ms | 78 ms |
| NVSwitch (multi-hop) | ~600 GB/s effective | 2.92 ms | 233 ms |
| InfiniBand NDR (inter-node) | 50 GB/s | 35 ms | 2,800 ms |
| NVMe SSD | 14 GB/s | 125 ms | 10,000 ms |
The bandwidth gap between NVLink 4.0 (900 GB/s) and NVMe SSD (14 GB/s) is 64x. This single fact explains why GPU-to-GPU streaming transforms cold starts from a 10-second problem to a 156-millisecond problem.
NIXL: The Inference Exchange Layer
NIXL is the low-level library that implements GPU-to-GPU data movement for both ModelExpress (weight streaming) and KV cache transfer (disaggregated serving).
Architecture
NIXL Stack:
Application (Dynamo Router / Planner)
|
NIXL API (Python + Rust bindings)
|
Transfer Engine (selects optimal path)
|
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ NVLink โ PCIe P2P โ RDMA/IB โ
โ (intra- โ (intra- โ (inter- โ
โ node) โ node, โ node) โ
โ โ slower) โ โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโ
|
GPU Memory (source) -----> GPU Memory (destination)
Key NIXL Operations
Weight transfer (ModelExpress):
# Pseudocode for NIXL weight streaming
def stream_model_weights(source_gpu, dest_gpu, model):
for layer_idx in range(model.num_layers):
# Get source memory address for this layer's weights
src_addr = source_gpu.get_layer_addr(layer_idx)
layer_size = model.layer_size_bytes(layer_idx)
# Initiate async DMA transfer
transfer_handle = nixl.async_transfer(
src_device=source_gpu.device_id,
dst_device=dest_gpu.device_id,
src_addr=src_addr,
dst_addr=dest_gpu.allocate(layer_size),
size=layer_size,
protocol="nvlink", # or "rdma" for inter-node
)
# Signal dest GPU that this layer is ready
transfer_handle.on_complete(
callback=lambda: dest_gpu.activate_layer(layer_idx)
)
KV cache transfer (disaggregated prefill-decode):
def transfer_kv_cache(prefill_gpu, decode_gpu, sequence):
"""Transfer KV cache from prefill worker to decode worker."""
kv_blocks = sequence.get_kv_block_addrs()
total_bytes = len(kv_blocks) * BLOCK_SIZE_BYTES
# Choose transfer protocol based on topology
if nixl.are_nvlink_connected(prefill_gpu, decode_gpu):
protocol = "nvlink" # 900 GB/s, 1.5ms for 1.34 GB
elif nixl.same_node(prefill_gpu, decode_gpu):
protocol = "pcie_p2p" # 28 GB/s, 48ms for 1.34 GB
else:
protocol = "rdma" # 50 GB/s, 27ms for 1.34 GB
# Batch transfer all blocks
nixl.bulk_transfer(
src_device=prefill_gpu.device_id,
dst_device=decode_gpu.device_id,
src_addrs=[b.addr for b in kv_blocks],
dst_addrs=decode_gpu.allocate_blocks(len(kv_blocks)),
sizes=[BLOCK_SIZE_BYTES] * len(kv_blocks),
protocol=protocol,
)
Multi-Path Transfers
When source and destination have multiple NVLink connections (GB200 NVL72 has up to 18 NVLink connections per GPU), NIXL stripes the transfer across all available paths:
Single NVLink lane: 900 GB/s / 18 = 50 GB/s per lane
All 18 lanes: 900 GB/s aggregate
Stripe 140 GB across: Each lane transfers ~7.8 GB
Wall time: 7.8 GB / 50 GB/s = 156 ms
This is the same 156ms total โ NVLink bandwidth is already aggregate. But for non-uniform topologies (some GPUs connected by more lanes than others), NIXLโs path selection avoids bottlenecks.
Binary-Level Checkpointing
Traditional model loading: load weights from disk, construct PyTorch model, compile CUDA graphs. The CUDA graph construction phase takes 2-3 seconds because it must trace the model forward pass, capture kernel launches, and optimize the execution graph.
ModelExpress skips this: it transfers not just weights but the entire CUDA graph binary from the source GPU. The destination GPU receives a pre-constructed CUDA graph that maps to the same GPU architecture. No reconstruction needed.
Traditional:
Load weights (10s) -> Build model (0.5s) -> Construct CUDA graphs (2s) -> Ready
ModelExpress:
Stream weights (156ms) -> Restore CUDA graph binary (5ms) -> Ready
The CUDA graph binary is small (10-50 MB) compared to model weights (140 GB). It transfers in under 1ms over NVLink.
Binary checkpoint restore only works between GPUs of the same architecture (H100-to-H100, B200-to-B200). Cross-architecture transfers (H100-to-B200) require CUDA graph reconstruction on the destination. Dynamo tracks GPU types in its cluster topology and only uses binary checkpoints for same-architecture transfers.