Deploying a new Llama 70B checkpoint takes 45-120 seconds to load weights, during which those GPUs serve zero requests—immediate revenue loss. Worse: if the new checkpoint has a quality regression (2.1 perplexity increase found in production at Meta), every request for the next 4 hours uses the bad model until someone notices and rolls back. Dynamo eliminates both risks with blue-green deployment: load the new model on standby GPUs, run 5% canary traffic, measure quality with LLM-as-judge, and either promote (shift all traffic in 200ms) or rollback (instant revert). Zero downtime, zero bad traffic.
The Model Version Registry
Dynamo maintains a registry of all deployed model versions with their metadata:
@dataclass
class ModelVersion:
version_id: str # Unique version identifier (e.g., "v2.3.1")
model_path: str # Path to model weights
config_hash: str # Hash of model config for integrity check
quantization: str # "fp16", "fp8", "gptq-int4", etc.
loaded_at: datetime # When this version was loaded
status: str # "loading", "ready", "active", "standby", "retired"
gpu_assignment: list # Which GPUs hold this version
health_metrics: dict # Latency percentiles, error rate
rollback_eligible: bool # Can we roll back to this version?
class ModelVersionRegistry:
def __init__(self):
self.versions = {} # version_id -> ModelVersion
self.active_version = None
self.standby_version = None
self.history = [] # Ordered list of past active versions
def register(self, version: ModelVersion):
self.versions[version.version_id] = version
def promote(self, version_id: str):
"""Make a version the active serving version."""
old_active = self.active_version
new_version = self.versions[version_id]
if new_version.status != "ready":
raise ValueError(f"Version {version_id} not ready")
# Demote current active to standby
if old_active:
old_active.status = "standby"
self.standby_version = old_active
new_version.status = "active"
self.active_version = new_version
self.history.append(version_id)
def rollback(self) -> str:
"""Rollback to the standby version."""
if self.standby_version is None:
raise ValueError("No standby version available")
current = self.active_version
standby = self.standby_version
current.status = "standby"
standby.status = "active"
self.active_version = standby
self.standby_version = current
return standby.version_id
Blue-Green Architecture
Blue-green deployment requires two complete sets of model weights in GPU memory simultaneously:
Blue Environment (Active — serving traffic):
GPU 0-3: Llama 70B v2.3.0, TP=4
Status: active
KV Cache: 37.5 GB per GPU
Green Environment (Standby — loaded, idle):
GPU 4-7: Llama 70B v2.3.1, TP=4
Status: standby
KV Cache: 37.5 GB per GPU
Router: sends all requests to Blue
Swap: Router switches to Green in < 100ms
Blue becomes standby
Green becomes active
class BlueGreenDeployer:
def __init__(self, cluster):
self.cluster = cluster
self.blue_pool = None # Currently serving
self.green_pool = None # Standby / loading
async def deploy_new_version(self, model_path: str,
version_id: str):
"""Load new version into green pool."""
# Step 1: Identify available GPUs for green
green_gpus = self.cluster.get_standby_gpus()
if len(green_gpus) < self.required_gpus:
raise InsufficientResourcesError(
f"Need {self.required_gpus} GPUs, "
f"have {len(green_gpus)} standby"
)
# Step 2: Load model into green pool (slow — 45-120s)
self.green_pool = await self.cluster.create_worker_pool(
gpus=green_gpus,
model_path=model_path,
version_id=version_id
)
# Step 3: Warm up green pool
await self.warmup(self.green_pool)
# Step 4: Mark green as ready
version = self.registry.versions[version_id]
version.status = "ready"
version.gpu_assignment = green_gpus
async def swap(self):
"""Atomic swap: green becomes active, blue becomes standby."""
if self.green_pool is None:
raise ValueError("No green pool ready")
# Atomic router update
await self.cluster.router.update_target(
self.green_pool.endpoint
)
# Swap references
old_blue = self.blue_pool
self.blue_pool = self.green_pool
self.green_pool = old_blue
# Update registry
self.registry.promote(self.blue_pool.version_id)
Blue-green deployment requires double the GPU resources. For Llama 70B with TP=4, you need 8 A100-80GB GPUs — 4 for active serving and 4 for the standby version. This is the cost of zero-downtime deployment. If GPU budget is constrained, use rolling deployment (Section 5) instead.
The Swap Protocol
The swap must be atomic from the client’s perspective — no requests should be lost or receive mixed responses from two model versions:
class AtomicSwapProtocol:
async def execute_swap(self, router, old_pool, new_pool):
"""Execute blue-green swap without dropping requests."""
# Phase 1: Drain — stop sending NEW requests to old pool
# In-flight requests on old pool continue to completion
router.set_drain_mode(old_pool)
# Phase 2: Wait for in-flight requests to complete
# With timeout to prevent infinite wait
drain_timeout = 30.0 # seconds
start = time.time()
while old_pool.in_flight_count > 0:
if time.time() - start > drain_timeout:
# Force-abort remaining requests
old_pool.abort_all()
break
await asyncio.sleep(0.1)
# Phase 3: Atomic route switch
router.set_active(new_pool)
# Phase 4: Old pool enters standby
old_pool.set_standby()
# Total swap time: drain time + route update (~100ms)
Swap Timing Analysis
Swap Phase Durations
| Phase | Duration | Requests Affected | Client Impact |
|---|---|---|---|
| Drain (new requests queued) | 0-30 s | Queued, not dropped | Higher TTFT |
| Route update | < 100 ms | 0 (atomic) | None |
| Standby transition | < 1 s | 0 | None |
| Total | < 31 s | Queued during drain | Temporary TTFT spike |
During the drain phase, new requests are queued at the router. Once the green pool goes active, queued requests are dispatched immediately. The maximum additional latency for a request arriving during drain is the drain duration.
Canary Rollout
Canary deployment sends a fraction of traffic to the new version before full promotion:
class CanaryRollout:
def __init__(self, router, registry):
self.router = router
self.registry = registry
self.canary_percentage = 0
self.metrics_window = 300 # 5 minutes per stage
async def start_canary(self, new_version_id: str,
stages: list = None):
"""Gradual traffic shift with automated rollback."""
if stages is None:
stages = [1, 5, 10, 25, 50, 100] # percentage stages
for pct in stages:
self.canary_percentage = pct
self.router.set_traffic_split(
primary=self.registry.active_version.version_id,
canary=new_version_id,
canary_pct=pct
)
# Monitor for metrics_window duration
metrics = await self.monitor(
duration=self.metrics_window,
version_id=new_version_id
)
# Check health gates
if not self.check_health_gates(metrics):
await self.rollback_canary(new_version_id)
return False
# All stages passed — promote
self.registry.promote(new_version_id)
return True
def check_health_gates(self, metrics: dict) -> bool:
"""Automated health check for canary."""
gates = {
"p99_ttft_ms": 500, # Max acceptable p99 TTFT
"p99_tpot_ms": 50, # Max acceptable p99 TPOT
"error_rate": 0.001, # Max 0.1% error rate
"throughput_ratio": 0.9, # At least 90% of baseline throughput
}
if metrics["p99_ttft_ms"] > gates["p99_ttft_ms"]:
return False
if metrics["p99_tpot_ms"] > gates["p99_tpot_ms"]:
return False
if metrics["error_rate"] > gates["error_rate"]:
return False
if metrics["throughput_ratio"] < gates["throughput_ratio"]:
return False
return True
async def rollback_canary(self, version_id: str):
"""Remove canary and send all traffic back to primary."""
self.router.set_traffic_split(
primary=self.registry.active_version.version_id,
canary=None,
canary_pct=0
)
self.canary_percentage = 0
Canary Traffic Percentage Over Time
Rolling Deployment (Resource-Efficient)
When GPU budget does not allow blue-green (double resources), rolling deployment updates workers one at a time:
class RollingDeployer:
def __init__(self, cluster, tp_size: int):
self.cluster = cluster
self.tp_size = tp_size
async def rolling_update(self, new_model_path: str,
new_version_id: str):
"""Update workers one TP group at a time."""
tp_groups = self.cluster.get_tp_groups()
for i, group in enumerate(tp_groups):
# Step 1: Remove this TP group from the serving pool
self.cluster.router.remove_backend(group.endpoint)
# Step 2: Drain in-flight requests on this group
await group.drain(timeout=30.0)
# Step 3: Unload old model
await group.unload_model()
# Step 4: Load new model (GPU memory freed by unload)
await group.load_model(new_model_path, new_version_id)
# Step 5: Warmup
await group.warmup()
# Step 6: Add back to serving pool
self.cluster.router.add_backend(group.endpoint)
# During steps 2-6, serving capacity is reduced
# by 1/num_tp_groups
Rolling deployment provides zero-downtime updates with no extra GPUs, but serving capacity drops by during each worker update, where is the number of TP groups. For 4 TP groups, capacity drops by 25% for the duration of each group’s update (45-120 seconds). Total update time is .
Deployment Strategy Comparison — Llama 70B, 4xA100 TP=4
| Strategy | Extra GPUs | Downtime | Max Capacity Drop | Rollback Time | Total Deploy Time |
|---|---|---|---|---|---|
| Blue-Green | 4 (100%) | 0 | 0% | < 1 s | ~120 s |
| Canary | 4 (100%) | 0 | 0% | < 1 s | ~30 min |
| Rolling (2 groups) | 0 | 0 | 50% | ~120 s | ~240 s |
| Rolling (4 groups) | 0 | 0 | 25% | ~120 s | ~480 s |
| Stop-Start | 0 | 120 s | 100% | ~120 s | ~120 s |
Instant Rollback Mechanism
The key feature of blue-green is sub-second rollback. The standby version remains loaded in GPU memory:
class RollbackController:
def __init__(self, registry, router, deployer):
self.registry = registry
self.router = router
self.deployer = deployer
async def instant_rollback(self) -> dict:
"""Rollback to standby version. Sub-second operation."""
standby = self.registry.standby_version
if standby is None or standby.status != "standby":
raise RollbackError("No standby version available")
start_time = time.time()
# Step 1: Route traffic to standby pool
self.router.set_active(
self.deployer.green_pool # Former blue, now standby
)
# Step 2: Update registry
rollback_version = self.registry.rollback()
elapsed = time.time() - start_time
return {
"rolled_back_to": rollback_version,
"elapsed_ms": elapsed * 1000,
"method": "instant_route_switch"
}
async def monitored_rollback(self, threshold_error_rate: float):
"""Automated rollback triggered by error rate."""
while True:
metrics = await self.get_current_metrics()
if metrics["error_rate"] > threshold_error_rate:
result = await self.instant_rollback()
await self.alert(
f"Auto-rollback triggered. "
f"Error rate {metrics['error_rate']:.4f} "
f"exceeded threshold {threshold_error_rate}. "
f"Rolled back to {result['rolled_back_to']} "
f"in {result['elapsed_ms']:.1f}ms."
)
return result
await asyncio.sleep(5.0) # Check every 5 seconds
Rollback Speed Analysis
# What happens during instant rollback:
# Router updates target endpoint (in-memory pointer swap): ~0.01 ms
# New requests go to standby pool: immediate
# In-flight requests on old active continue to completion
# (no abort needed — they finish naturally)
# Total rollback latency: < 1 ms for the route switch
# First request on rollback version: < 50 ms (already warm)
# In-flight requests on bad version: drain over 0-30 seconds
Rollback Latency by Strategy
Weight Storage and Caching
Fast model loading requires efficient weight storage:
class WeightCache:
def __init__(self, cache_dir: str, max_versions: int = 3):
self.cache_dir = cache_dir
self.max_versions = max_versions
def cache_weights(self, version_id: str, model_path: str):
"""Pre-download and cache model weights locally."""
cache_path = os.path.join(self.cache_dir, version_id)
if os.path.exists(cache_path):
return # Already cached
# Download from model registry (S3, GCS, etc.)
download_model(model_path, cache_path)
# Evict oldest cached version if over limit
cached = self.list_cached()
if len(cached) > self.max_versions:
oldest = cached[0]
shutil.rmtree(os.path.join(self.cache_dir, oldest))
def get_local_path(self, version_id: str) -> str:
path = os.path.join(self.cache_dir, version_id)
if not os.path.exists(path):
raise CacheMissError(f"Version {version_id} not cached")
return path
Model Load Time by Storage Medium — Llama 70B FP16
| Source | Read BW | Load Time | Includes Repack |
|---|---|---|---|
| Local NVMe SSD | 7 GB/s | 20 s | No |
| Local NVMe + GPTQ repack | 7 GB/s | 24 s | Yes |
| NFS (10 GbE) | 1.1 GB/s | 127 s | No |
| S3 (single stream) | 0.5 GB/s | 280 s | No |
| S3 (parallel 32 streams) | 4 GB/s | 35 s | No |
| GPU-to-GPU (NVLink) | 300 GB/s | 0.5 s | No |
The last row shows why keeping the standby version in GPU memory is critical: reloading from any storage medium takes 20-280 seconds, while an in-memory rollback takes under 1 millisecond.
Version Compatibility
When switching between model versions, the KV cache from the old version is incompatible with the new version:
class VersionCompatibilityChecker:
def check_kv_cache_compatible(self, old_version, new_version) -> bool:
"""Check if KV cache can be reused across versions."""
# KV cache is compatible ONLY if:
# 1. Same architecture (num_layers, num_heads, head_dim)
# 2. Same quantization method
# 3. Same max_model_len and block_size
# In practice, this means same model with different LoRA adapters
# or same model with only output head changes
return (
old_version.architecture == new_version.architecture and
old_version.quantization == new_version.quantization and
old_version.num_layers == new_version.num_layers and
old_version.kv_heads == new_version.kv_heads and
old_version.head_dim == new_version.head_dim
)
Model version swaps invalidate ALL KV cache. Every in-flight request must either complete before the swap or be aborted and restarted. For sessions with long context (32K+ tokens), recomputing the KV cache costs 200-800ms of prefill time. The drain phase of the swap protocol exists specifically to let these requests finish naturally.
Deployment Pipeline Integration
# deployment.yaml — CI/CD integration
pipeline:
stages:
- name: validate
steps:
- run: python validate_model.py --model $MODEL_PATH
- run: python check_quant_compatibility.py --model $MODEL_PATH
- name: cache_weights
steps:
- run: dynamo cache download --model $MODEL_PATH --version $VERSION
- name: load_green
steps:
- run: dynamo deploy green --model $MODEL_PATH --version $VERSION
- wait_for: green_ready
- name: canary
steps:
- run: dynamo canary start --version $VERSION --stages 1,5,25,50,100
- monitor: health_gates
- on_failure: dynamo rollback --instant
- name: promote
steps:
- run: dynamo swap --confirm
- run: dynamo notify --channel ops --message "Promoted $VERSION"
rollback:
trigger: error_rate > 0.001 OR p99_latency > 500ms
action: dynamo rollback --instant
notify: pagerduty
# CLI commands for operations
dynamo version list
# VERSION STATUS LOADED_AT GPU_ASSIGNMENT
# v2.3.0 standby 2025-03-22 08:00 GPU 4-7
# v2.3.1 active 2025-03-22 10:30 GPU 0-3
dynamo rollback --instant
# Rolled back to v2.3.0 in 0.8ms
# Traffic now serving from GPU 4-7
dynamo version retire v2.2.0
# Retired v2.2.0, freed 0 GPUs (was not loaded)
Cost of Safety
Resource Cost of Deployment Safety Strategies
| Strategy | GPU Overhead | Monthly Cost (4xA100) | Rollback SLA | Risk Level |
|---|---|---|---|---|
| None (stop-start) | 0% | $0 | 2 min | High |
| Blue-Green | 100% | $8,640 | < 1 ms | Very Low |
| Canary + Blue-Green | 100% | $8,640 | < 1 ms | Lowest |
| Rolling | 0% | $0 | 2 min | Medium |
| Blue-Green (time-shared) | 50% | $4,320 | 20 s | Low |
The time-shared variant loads the standby version onto the same GPUs but keeps weights in CPU memory, swapping to GPU on rollback. This takes approximately 20 seconds instead of sub-millisecond but halves the GPU cost.
Summary
NVIDIA Dynamo’s model versioning system provides zero-downtime deployment through blue-green architecture with instant rollback. The standby model version remains loaded in GPU memory, enabling sub-millisecond route switches when issues are detected. Canary rollout progressively shifts traffic (1%, 5%, 25%, 50%, 100%) with automated health gate checking at each stage. Rolling deployment offers a resource-efficient alternative when double GPU allocation is not feasible, at the cost of 25-50% temporary capacity reduction and 2-minute rollback time. KV cache invalidation is the primary constraint during version swaps — all in-flight requests must drain before the swap completes. For production deployments serving critical traffic, blue-green with canary provides the safest path: automated rollback triggers on error rate or latency regression, with recovery in under 1 millisecond.