Part of Series NVIDIA Dynamo & llm-d 29 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Deploying a new Llama 70B checkpoint takes 45-120 seconds to load weights, during which those GPUs serve zero requests—immediate revenue loss. Worse: if the new checkpoint has a quality regression (2.1 perplexity increase found in production at Meta), every request for the next 4 hours uses the bad model until someone notices and rolls back. Dynamo eliminates both risks with blue-green deployment: load the new model on standby GPUs, run 5% canary traffic, measure quality with LLM-as-judge, and either promote (shift all traffic in 200ms) or rollback (instant revert). Zero downtime, zero bad traffic.

The Model Version Registry

Dynamo maintains a registry of all deployed model versions with their metadata:

@dataclass
class ModelVersion:
    version_id: str           # Unique version identifier (e.g., "v2.3.1")
    model_path: str           # Path to model weights
    config_hash: str          # Hash of model config for integrity check
    quantization: str         # "fp16", "fp8", "gptq-int4", etc.
    loaded_at: datetime       # When this version was loaded
    status: str               # "loading", "ready", "active", "standby", "retired"
    gpu_assignment: list      # Which GPUs hold this version
    health_metrics: dict      # Latency percentiles, error rate
    rollback_eligible: bool   # Can we roll back to this version?

class ModelVersionRegistry:
    def __init__(self):
        self.versions = {}      # version_id -> ModelVersion
        self.active_version = None
        self.standby_version = None
        self.history = []       # Ordered list of past active versions

    def register(self, version: ModelVersion):
        self.versions[version.version_id] = version

    def promote(self, version_id: str):
        """Make a version the active serving version."""
        old_active = self.active_version
        new_version = self.versions[version_id]

        if new_version.status != "ready":
            raise ValueError(f"Version {version_id} not ready")

        # Demote current active to standby
        if old_active:
            old_active.status = "standby"
            self.standby_version = old_active

        new_version.status = "active"
        self.active_version = new_version
        self.history.append(version_id)

    def rollback(self) -> str:
        """Rollback to the standby version."""
        if self.standby_version is None:
            raise ValueError("No standby version available")

        current = self.active_version
        standby = self.standby_version

        current.status = "standby"
        standby.status = "active"

        self.active_version = standby
        self.standby_version = current

        return standby.version_id

Blue-Green Architecture

Blue-green deployment requires two complete sets of model weights in GPU memory simultaneously:

Blue Environment (Active — serving traffic):
  GPU 0-3: Llama 70B v2.3.0, TP=4
  Status: active
  KV Cache: 37.5 GB per GPU

Green Environment (Standby — loaded, idle):
  GPU 4-7: Llama 70B v2.3.1, TP=4
  Status: standby
  KV Cache: 37.5 GB per GPU

Router: sends all requests to Blue

Swap: Router switches to Green in < 100ms
      Blue becomes standby
      Green becomes active
class BlueGreenDeployer:
    def __init__(self, cluster):
        self.cluster = cluster
        self.blue_pool = None   # Currently serving
        self.green_pool = None  # Standby / loading

    async def deploy_new_version(self, model_path: str,
                                   version_id: str):
        """Load new version into green pool."""
        # Step 1: Identify available GPUs for green
        green_gpus = self.cluster.get_standby_gpus()
        if len(green_gpus) < self.required_gpus:
            raise InsufficientResourcesError(
                f"Need {self.required_gpus} GPUs, "
                f"have {len(green_gpus)} standby"
            )

        # Step 2: Load model into green pool (slow — 45-120s)
        self.green_pool = await self.cluster.create_worker_pool(
            gpus=green_gpus,
            model_path=model_path,
            version_id=version_id
        )

        # Step 3: Warm up green pool
        await self.warmup(self.green_pool)

        # Step 4: Mark green as ready
        version = self.registry.versions[version_id]
        version.status = "ready"
        version.gpu_assignment = green_gpus

    async def swap(self):
        """Atomic swap: green becomes active, blue becomes standby."""
        if self.green_pool is None:
            raise ValueError("No green pool ready")

        # Atomic router update
        await self.cluster.router.update_target(
            self.green_pool.endpoint
        )

        # Swap references
        old_blue = self.blue_pool
        self.blue_pool = self.green_pool
        self.green_pool = old_blue

        # Update registry
        self.registry.promote(self.blue_pool.version_id)
⚠️ Warning

Blue-green deployment requires double the GPU resources. For Llama 70B with TP=4, you need 8 A100-80GB GPUs — 4 for active serving and 4 for the standby version. This is the cost of zero-downtime deployment. If GPU budget is constrained, use rolling deployment (Section 5) instead.

The Swap Protocol

The swap must be atomic from the client’s perspective — no requests should be lost or receive mixed responses from two model versions:

class AtomicSwapProtocol:
    async def execute_swap(self, router, old_pool, new_pool):
        """Execute blue-green swap without dropping requests."""

        # Phase 1: Drain — stop sending NEW requests to old pool
        # In-flight requests on old pool continue to completion
        router.set_drain_mode(old_pool)

        # Phase 2: Wait for in-flight requests to complete
        # With timeout to prevent infinite wait
        drain_timeout = 30.0  # seconds
        start = time.time()
        while old_pool.in_flight_count > 0:
            if time.time() - start > drain_timeout:
                # Force-abort remaining requests
                old_pool.abort_all()
                break
            await asyncio.sleep(0.1)

        # Phase 3: Atomic route switch
        router.set_active(new_pool)

        # Phase 4: Old pool enters standby
        old_pool.set_standby()

        # Total swap time: drain time + route update (~100ms)

Swap Timing Analysis

📊

Swap Phase Durations

PhaseDurationRequests AffectedClient Impact
Drain (new requests queued) 0-30 s Queued, not dropped Higher TTFT
Route update < 100 ms 0 (atomic) None
Standby transition < 1 s 0 None
Total < 31 s Queued during drain Temporary TTFT spike

During the drain phase, new requests are queued at the router. Once the green pool goes active, queued requests are dispatched immediately. The maximum additional latency for a request arriving during drain is the drain duration.

Canary Rollout

Canary deployment sends a fraction of traffic to the new version before full promotion:

class CanaryRollout:
    def __init__(self, router, registry):
        self.router = router
        self.registry = registry
        self.canary_percentage = 0
        self.metrics_window = 300  # 5 minutes per stage

    async def start_canary(self, new_version_id: str,
                            stages: list = None):
        """Gradual traffic shift with automated rollback."""
        if stages is None:
            stages = [1, 5, 10, 25, 50, 100]  # percentage stages

        for pct in stages:
            self.canary_percentage = pct
            self.router.set_traffic_split(
                primary=self.registry.active_version.version_id,
                canary=new_version_id,
                canary_pct=pct
            )

            # Monitor for metrics_window duration
            metrics = await self.monitor(
                duration=self.metrics_window,
                version_id=new_version_id
            )

            # Check health gates
            if not self.check_health_gates(metrics):
                await self.rollback_canary(new_version_id)
                return False

        # All stages passed — promote
        self.registry.promote(new_version_id)
        return True

    def check_health_gates(self, metrics: dict) -> bool:
        """Automated health check for canary."""
        gates = {
            "p99_ttft_ms": 500,      # Max acceptable p99 TTFT
            "p99_tpot_ms": 50,       # Max acceptable p99 TPOT
            "error_rate": 0.001,     # Max 0.1% error rate
            "throughput_ratio": 0.9, # At least 90% of baseline throughput
        }

        if metrics["p99_ttft_ms"] > gates["p99_ttft_ms"]:
            return False
        if metrics["p99_tpot_ms"] > gates["p99_tpot_ms"]:
            return False
        if metrics["error_rate"] > gates["error_rate"]:
            return False
        if metrics["throughput_ratio"] < gates["throughput_ratio"]:
            return False

        return True

    async def rollback_canary(self, version_id: str):
        """Remove canary and send all traffic back to primary."""
        self.router.set_traffic_split(
            primary=self.registry.active_version.version_id,
            canary=None,
            canary_pct=0
        )
        self.canary_percentage = 0

Canary Traffic Percentage Over Time

Stage 1 (5min)
1
Stage 2 (5min)
5
Stage 3 (5min)
10
Stage 4 (5min)
25
Stage 5 (5min)
50
Full rollout
100

Rolling Deployment (Resource-Efficient)

When GPU budget does not allow blue-green (double resources), rolling deployment updates workers one at a time:

class RollingDeployer:
    def __init__(self, cluster, tp_size: int):
        self.cluster = cluster
        self.tp_size = tp_size

    async def rolling_update(self, new_model_path: str,
                              new_version_id: str):
        """Update workers one TP group at a time."""
        tp_groups = self.cluster.get_tp_groups()

        for i, group in enumerate(tp_groups):
            # Step 1: Remove this TP group from the serving pool
            self.cluster.router.remove_backend(group.endpoint)

            # Step 2: Drain in-flight requests on this group
            await group.drain(timeout=30.0)

            # Step 3: Unload old model
            await group.unload_model()

            # Step 4: Load new model (GPU memory freed by unload)
            await group.load_model(new_model_path, new_version_id)

            # Step 5: Warmup
            await group.warmup()

            # Step 6: Add back to serving pool
            self.cluster.router.add_backend(group.endpoint)

            # During steps 2-6, serving capacity is reduced
            # by 1/num_tp_groups
ℹ️ Note

Rolling deployment provides zero-downtime updates with no extra GPUs, but serving capacity drops by 1/N1/N during each worker update, where NN is the number of TP groups. For 4 TP groups, capacity drops by 25% for the duration of each group’s update (45-120 seconds). Total update time is N×(drain+load+warmup)N \times (\text{drain} + \text{load} + \text{warmup}).

📊

Deployment Strategy Comparison — Llama 70B, 4xA100 TP=4

StrategyExtra GPUsDowntimeMax Capacity DropRollback TimeTotal Deploy Time
Blue-Green 4 (100%) 0 0% < 1 s ~120 s
Canary 4 (100%) 0 0% < 1 s ~30 min
Rolling (2 groups) 0 0 50% ~120 s ~240 s
Rolling (4 groups) 0 0 25% ~120 s ~480 s
Stop-Start 0 120 s 100% ~120 s ~120 s

Instant Rollback Mechanism

The key feature of blue-green is sub-second rollback. The standby version remains loaded in GPU memory:

class RollbackController:
    def __init__(self, registry, router, deployer):
        self.registry = registry
        self.router = router
        self.deployer = deployer

    async def instant_rollback(self) -> dict:
        """Rollback to standby version. Sub-second operation."""
        standby = self.registry.standby_version
        if standby is None or standby.status != "standby":
            raise RollbackError("No standby version available")

        start_time = time.time()

        # Step 1: Route traffic to standby pool
        self.router.set_active(
            self.deployer.green_pool  # Former blue, now standby
        )

        # Step 2: Update registry
        rollback_version = self.registry.rollback()

        elapsed = time.time() - start_time

        return {
            "rolled_back_to": rollback_version,
            "elapsed_ms": elapsed * 1000,
            "method": "instant_route_switch"
        }

    async def monitored_rollback(self, threshold_error_rate: float):
        """Automated rollback triggered by error rate."""
        while True:
            metrics = await self.get_current_metrics()

            if metrics["error_rate"] > threshold_error_rate:
                result = await self.instant_rollback()
                await self.alert(
                    f"Auto-rollback triggered. "
                    f"Error rate {metrics['error_rate']:.4f} "
                    f"exceeded threshold {threshold_error_rate}. "
                    f"Rolled back to {result['rolled_back_to']} "
                    f"in {result['elapsed_ms']:.1f}ms."
                )
                return result

            await asyncio.sleep(5.0)  # Check every 5 seconds

Rollback Speed Analysis

# What happens during instant rollback:
# Router updates target endpoint (in-memory pointer swap): ~0.01 ms
# New requests go to standby pool: immediate
# In-flight requests on old active continue to completion
#    (no abort needed — they finish naturally)

# Total rollback latency: < 1 ms for the route switch
# First request on rollback version: < 50 ms (already warm)
# In-flight requests on bad version: drain over 0-30 seconds

Rollback Latency by Strategy

Blue-Green instant
1
Canary abort
500
Rolling revert
120,000
Stop-Start reload
120,000

Weight Storage and Caching

Fast model loading requires efficient weight storage:

class WeightCache:
    def __init__(self, cache_dir: str, max_versions: int = 3):
        self.cache_dir = cache_dir
        self.max_versions = max_versions

    def cache_weights(self, version_id: str, model_path: str):
        """Pre-download and cache model weights locally."""
        cache_path = os.path.join(self.cache_dir, version_id)
        if os.path.exists(cache_path):
            return  # Already cached

        # Download from model registry (S3, GCS, etc.)
        download_model(model_path, cache_path)

        # Evict oldest cached version if over limit
        cached = self.list_cached()
        if len(cached) > self.max_versions:
            oldest = cached[0]
            shutil.rmtree(os.path.join(self.cache_dir, oldest))

    def get_local_path(self, version_id: str) -> str:
        path = os.path.join(self.cache_dir, version_id)
        if not os.path.exists(path):
            raise CacheMissError(f"Version {version_id} not cached")
        return path
📊

Model Load Time by Storage Medium — Llama 70B FP16

SourceRead BWLoad TimeIncludes Repack
Local NVMe SSD 7 GB/s 20 s No
Local NVMe + GPTQ repack 7 GB/s 24 s Yes
NFS (10 GbE) 1.1 GB/s 127 s No
S3 (single stream) 0.5 GB/s 280 s No
S3 (parallel 32 streams) 4 GB/s 35 s No
GPU-to-GPU (NVLink) 300 GB/s 0.5 s No

The last row shows why keeping the standby version in GPU memory is critical: reloading from any storage medium takes 20-280 seconds, while an in-memory rollback takes under 1 millisecond.

Version Compatibility

When switching between model versions, the KV cache from the old version is incompatible with the new version:

class VersionCompatibilityChecker:
    def check_kv_cache_compatible(self, old_version, new_version) -> bool:
        """Check if KV cache can be reused across versions."""
        # KV cache is compatible ONLY if:
        # 1. Same architecture (num_layers, num_heads, head_dim)
        # 2. Same quantization method
        # 3. Same max_model_len and block_size
        # In practice, this means same model with different LoRA adapters
        # or same model with only output head changes

        return (
            old_version.architecture == new_version.architecture and
            old_version.quantization == new_version.quantization and
            old_version.num_layers == new_version.num_layers and
            old_version.kv_heads == new_version.kv_heads and
            old_version.head_dim == new_version.head_dim
        )
🚨 Danger

Model version swaps invalidate ALL KV cache. Every in-flight request must either complete before the swap or be aborted and restarted. For sessions with long context (32K+ tokens), recomputing the KV cache costs 200-800ms of prefill time. The drain phase of the swap protocol exists specifically to let these requests finish naturally.

Deployment Pipeline Integration

# deployment.yaml — CI/CD integration
pipeline:
  stages:
    - name: validate
      steps:
        - run: python validate_model.py --model $MODEL_PATH
        - run: python check_quant_compatibility.py --model $MODEL_PATH

    - name: cache_weights
      steps:
        - run: dynamo cache download --model $MODEL_PATH --version $VERSION

    - name: load_green
      steps:
        - run: dynamo deploy green --model $MODEL_PATH --version $VERSION
        - wait_for: green_ready

    - name: canary
      steps:
        - run: dynamo canary start --version $VERSION --stages 1,5,25,50,100
        - monitor: health_gates
        - on_failure: dynamo rollback --instant

    - name: promote
      steps:
        - run: dynamo swap --confirm
        - run: dynamo notify --channel ops --message "Promoted $VERSION"

  rollback:
    trigger: error_rate > 0.001 OR p99_latency > 500ms
    action: dynamo rollback --instant
    notify: pagerduty
# CLI commands for operations
dynamo version list
# VERSION    STATUS    LOADED_AT           GPU_ASSIGNMENT
# v2.3.0     standby   2025-03-22 08:00    GPU 4-7
# v2.3.1     active    2025-03-22 10:30    GPU 0-3

dynamo rollback --instant
# Rolled back to v2.3.0 in 0.8ms
# Traffic now serving from GPU 4-7

dynamo version retire v2.2.0
# Retired v2.2.0, freed 0 GPUs (was not loaded)

Cost of Safety

📊

Resource Cost of Deployment Safety Strategies

StrategyGPU OverheadMonthly Cost (4xA100)Rollback SLARisk Level
None (stop-start) 0% $0 2 min High
Blue-Green 100% $8,640 < 1 ms Very Low
Canary + Blue-Green 100% $8,640 < 1 ms Lowest
Rolling 0% $0 2 min Medium
Blue-Green (time-shared) 50% $4,320 20 s Low

The time-shared variant loads the standby version onto the same GPUs but keeps weights in CPU memory, swapping to GPU on rollback. This takes approximately 20 seconds instead of sub-millisecond but halves the GPU cost.

Summary

NVIDIA Dynamo’s model versioning system provides zero-downtime deployment through blue-green architecture with instant rollback. The standby model version remains loaded in GPU memory, enabling sub-millisecond route switches when issues are detected. Canary rollout progressively shifts traffic (1%, 5%, 25%, 50%, 100%) with automated health gate checking at each stage. Rolling deployment offers a resource-efficient alternative when double GPU allocation is not feasible, at the cost of 25-50% temporary capacity reduction and 2-minute rollback time. KV cache invalidation is the primary constraint during version swaps — all in-flight requests must drain before the swap completes. For production deployments serving critical traffic, blue-green with canary provides the safest path: automated rollback triggers on error rate or latency regression, with recovery in under 1 millisecond.