Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments

Part of Series NVIDIA Dynamo & llm-d 22 of 30

1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Dynamo was designed for cloud datacenters: reliable networking, homogeneous GPU nodes, centralized orchestration. Production deployments increasingly require on-premise and hybrid configurations. A hospital running medical AI needs inference on local hardware for HIPAA compliance. A factory running quality inspection AI needs low-latency inference without cloud round-trips. A defense contractor needs air-gapped inference with no external connectivity.

Extending Dynamo to edge and hybrid deployments introduces three challenges that cloud deployments avoid. First, network connectivity between edge nodes and cloud is intermittent or high-latency (50-200ms WAN vs 0.1ms LAN). Second, hardware is heterogeneous (a mix of A100, L40S, T4, and even Jetson devices). Third, the control plane must function during disconnection — edge nodes cannot depend on a cloud-hosted scheduler that they cannot reach.

This post covers the architecture for extending Dynamo to hybrid and edge deployments: local orchestration, model sharding across heterogeneous hardware, latency-aware routing, offline operation, and synchronization protocols.

Hybrid Architecture

Edge-Cloud Topology

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time

class NodeLocation(Enum):
    CLOUD = "cloud"
    EDGE_DATACENTER = "edge_datacenter"
    ON_PREMISE = "on_premise"
    FIELD_DEVICE = "field_device"

class ConnectivityState(Enum):
    CONNECTED = "connected"
    DEGRADED = "degraded"
    DISCONNECTED = "disconnected"

@dataclass
class EdgeNode:
    """An edge inference node."""
    node_id: str
    location: NodeLocation
    gpu_type: str
    gpu_count: int
    gpu_vram_gb: float
    cpu_cores: int
    ram_gb: int
    connectivity: ConnectivityState
    latency_to_cloud_ms: float
    bandwidth_to_cloud_mbps: float
    models_loaded: list = field(default_factory=list)
    last_heartbeat: float = 0.0

class HybridDynamoArchitecture:
    """
    Hybrid Dynamo deployment with cloud and edge tiers.

    Architecture:
    - Tier 1 (Cloud): Full Dynamo cluster with centralized
      planner, KV-aware routing, and autoscaling.
      Handles overflow and complex requests.
    - Tier 2 (Edge Datacenter): Local Dynamo instance
      with reduced planner. Handles most requests locally.
      Syncs with cloud when connected.
    - Tier 3 (On-Premise): Single-node or small cluster.
      Runs independently with local routing.
      Syncs configuration periodically.
    - Tier 4 (Field Device): Jetson/embedded with
      quantized small models. Fully autonomous.

    Request routing priority:
    1. Route to local edge if model available and
       latency requirement met
    2. Route to nearest edge datacenter if local cannot serve
    3. Route to cloud as fallback
    """

    def __init__(self, config):
        self.cloud_endpoint = config.get("cloud_endpoint")
        self.edge_nodes = {}
        self.local_planner = LocalPlanner(config)
        self.sync_manager = SyncManager(config)

    def register_edge_node(self, node):
        """Register an edge node with the hybrid orchestrator."""
        self.edge_nodes[node.node_id] = node

    def route_request(self, request):
        """
        Route a request to the best available tier.

        Decision factors:
        1. Latency requirement (strict SLA)
        2. Model availability (is the model loaded locally?)
        3. Current load on each tier
        4. Network connectivity state
        """
        latency_budget_ms = request.get(
            "max_latency_ms", 1000
        )
        required_model = request.get("model")

        # Try local edge nodes first
        local_candidates = self._find_local_candidates(
            required_model, latency_budget_ms
        )

        if local_candidates:
            best = min(
                local_candidates,
                key=lambda n: n.latency_to_cloud_ms,
            )
            return {
                "tier": "edge",
                "node_id": best.node_id,
                "estimated_latency_ms": 10,
                "reason": "Local node has model loaded",
            }

        # Try edge datacenter
        edge_dc = self._find_edge_datacenter(
            required_model, latency_budget_ms
        )

        if edge_dc:
            return {
                "tier": "edge_datacenter",
                "node_id": edge_dc.node_id,
                "estimated_latency_ms": edge_dc.latency_to_cloud_ms,
                "reason": "Edge DC available within SLA",
            }

        # Fall back to cloud
        cloud_latency = self._estimate_cloud_latency()
        if cloud_latency <= latency_budget_ms:
            return {
                "tier": "cloud",
                "endpoint": self.cloud_endpoint,
                "estimated_latency_ms": cloud_latency,
                "reason": "Cloud fallback within SLA",
            }

        # Cannot meet SLA
        return {
            "tier": "none",
            "reason": (
                f"No tier can meet {latency_budget_ms}ms SLA"
            ),
        }

    def _find_local_candidates(self, model, latency_ms):
        """Find local nodes that have the model loaded."""
        candidates = []
        for node in self.edge_nodes.values():
            if (
                model in node.models_loaded
                and node.connectivity != ConnectivityState.DISCONNECTED
                and self._is_healthy(node)
            ):
                candidates.append(node)
        return candidates

    def _find_edge_datacenter(self, model, latency_ms):
        """Find an edge datacenter within latency budget."""
        for node in self.edge_nodes.values():
            if (
                node.location == NodeLocation.EDGE_DATACENTER
                and model in node.models_loaded
                and node.latency_to_cloud_ms < latency_ms * 0.5
            ):
                return node
        return None

    def _estimate_cloud_latency(self):
        """Estimate latency to cloud endpoint."""
        return 100.0  # Placeholder

    def _is_healthy(self, node):
        """Check if node has sent recent heartbeat."""
        return (time.time() - node.last_heartbeat) < 60

⚠️ Warning

Edge deployments must handle the “split brain” problem: when an edge node loses connectivity to the cloud, both the edge and cloud may continue serving requests with diverging model versions or KV cache states. The resolution protocol must be defined before deployment: does the edge version take precedence? Does the cloud version? Is there a merge strategy?

Heterogeneous Hardware Support

Sharding Across Mixed GPUs

class HeterogeneousShardPlanner:
    """
    Plan model sharding across mixed GPU hardware.

    Challenge: standard tensor parallelism assumes
    homogeneous GPUs. An A100 (80GB) and a T4 (16GB)
    cannot each hold half a 70B model. The A100 must
    hold more layers, and the T4 handles fewer layers.

    Solution: pipeline parallelism with uneven stage
    assignment. Assign layers proportional to GPU memory
    and compute capability.
    """

    GPU_SPECS = {
        "H100": {
            "vram_gb": 80,
            "tflops_fp16": 989,
            "bandwidth_gbs": 3350,
        },
        "A100": {
            "vram_gb": 80,
            "tflops_fp16": 312,
            "bandwidth_gbs": 2039,
        },
        "L40S": {
            "vram_gb": 48,
            "tflops_fp16": 362,
            "bandwidth_gbs": 864,
        },
        "A10G": {
            "vram_gb": 24,
            "tflops_fp16": 125,
            "bandwidth_gbs": 600,
        },
        "T4": {
            "vram_gb": 16,
            "tflops_fp16": 65,
            "bandwidth_gbs": 300,
        },
        "Jetson_Orin": {
            "vram_gb": 32,  # Shared memory
            "tflops_fp16": 67,
            "bandwidth_gbs": 204,
        },
    }

    def plan_sharding(self, model_config, available_gpus):
        """
        Plan how to shard a model across heterogeneous GPUs.

        model_config: {n_layers, hidden_dim, n_heads, ...}
        available_gpus: list of GPU types
        """
        n_layers = model_config["n_layers"]
        model_size_gb = model_config["size_gb"]

        # Calculate per-layer memory
        per_layer_gb = model_size_gb / n_layers

        # Sort GPUs by memory (largest first)
        gpu_specs = [
            (gpu, self.GPU_SPECS.get(gpu, {}))
            for gpu in available_gpus
        ]
        gpu_specs.sort(
            key=lambda x: x[1].get("vram_gb", 0),
            reverse=True,
        )

        # Assign layers proportional to available memory
        total_usable_memory = sum(
            spec.get("vram_gb", 0) * 0.85  # 85% utilization
            for _, spec in gpu_specs
        )

        if total_usable_memory < model_size_gb:
            return {
                "feasible": False,
                "error": (
                    f"Need {model_size_gb:.1f} GB, "
                    f"have {total_usable_memory:.1f} GB"
                ),
            }

        assignments = []
        remaining_layers = n_layers

        for gpu_name, spec in gpu_specs:
            usable = spec.get("vram_gb", 0) * 0.85
            layers_for_gpu = min(
                int(usable / per_layer_gb),
                remaining_layers,
            )

            assignments.append({
                "gpu": gpu_name,
                "layers": layers_for_gpu,
                "memory_used_gb": layers_for_gpu * per_layer_gb,
                "memory_available_gb": spec.get("vram_gb", 0),
                "compute_tflops": spec.get("tflops_fp16", 0),
            })

            remaining_layers -= layers_for_gpu
            if remaining_layers <= 0:
                break

        # Check for bottleneck (slowest GPU determines throughput)
        if assignments:
            # Time per layer proportional to compute
            times = [
                a["layers"] / max(a["compute_tflops"], 1)
                for a in assignments
            ]
            bottleneck_idx = int(np.argmax(times))
            bottleneck = assignments[bottleneck_idx]
        else:
            bottleneck = None

        return {
            "feasible": remaining_layers <= 0,
            "assignments": assignments,
            "bottleneck": bottleneck,
            "total_layers": n_layers,
            "assigned_layers": n_layers - remaining_layers,
        }

📊

Heterogeneous Sharding: Llama 3.1 70B Across Mixed GPUs

Configuration	GPU Setup	Throughput (tok/s)	TTFT (ms)	Memory Used	Notes
Homogeneous baseline	4x A100 80GB	2400	45	280 GB	Standard TP=4
Mixed high-end	2x A100 + 2x L40S	1800	65	256 GB	Pipeline parallel, L40S is bottleneck
Mixed with consumer	1x A100 + 3x A10G	900	120	152 GB	A10G limits throughput
All edge	4x L40S 48GB	1600	55	192 GB	Viable for edge datacenter
Quantized edge	2x A10G (Q4_K_M)	600	90	40 GB	Quantized fits 2 GPUs

Offline Mode and Synchronization

Operating Without Cloud Connectivity

class OfflineOperationManager:
    """
    Manage edge node operation during cloud disconnection.

    Offline capabilities:
    1. Continue serving with locally loaded models
    2. Queue configuration updates for sync
    3. Buffer telemetry data for upload
    4. Local model versioning and rollback
    5. Degraded routing (local-only, no cloud fallback)
    """

    def __init__(self, config):
        self.connectivity_state = ConnectivityState.CONNECTED
        self.pending_sync_queue = []
        self.telemetry_buffer = []
        self.local_model_registry = {}
        self.max_offline_hours = config.get(
            "max_offline_hours", 72
        )
        self.offline_start_time = None

    def enter_offline_mode(self):
        """
        Transition to offline mode.

        Actions:
        1. Cache current routing table locally
        2. Switch to local-only routing
        3. Start buffering telemetry
        4. Set offline timer
        """
        self.connectivity_state = (
            ConnectivityState.DISCONNECTED
        )
        self.offline_start_time = time.time()

        return {
            "status": "offline",
            "cached_models": list(
                self.local_model_registry.keys()
            ),
            "max_offline_hours": self.max_offline_hours,
        }

    def check_offline_health(self):
        """
        Check health during offline operation.

        Concerns:
        - Model staleness (running outdated version)
        - KV cache growth (no distributed eviction)
        - Telemetry buffer overflow
        - Certificates and tokens expiring
        """
        if self.offline_start_time is None:
            return {"healthy": True}

        offline_hours = (
            time.time() - self.offline_start_time
        ) / 3600

        issues = []

        if offline_hours > self.max_offline_hours:
            issues.append(
                f"Offline for {offline_hours:.1f}h "
                f"(max: {self.max_offline_hours}h)"
            )

        if len(self.telemetry_buffer) > 100000:
            issues.append(
                f"Telemetry buffer at {len(self.telemetry_buffer)} "
                f"entries (risk of overflow)"
            )

        return {
            "healthy": len(issues) == 0,
            "offline_hours": round(offline_hours, 1),
            "issues": issues,
            "telemetry_buffered": len(self.telemetry_buffer),
        }

    def sync_on_reconnect(self, cloud_client):
        """
        Synchronize state when connectivity is restored.

        Sync protocol:
        1. Upload buffered telemetry
        2. Pull latest model versions
        3. Pull updated routing configuration
        4. Apply queued configuration changes
        5. Resume normal operation
        """
        sync_results = {
            "telemetry_uploaded": 0,
            "models_updated": 0,
            "config_changes_applied": 0,
        }

        # Upload telemetry
        if self.telemetry_buffer:
            cloud_client.upload_telemetry(
                self.telemetry_buffer
            )
            sync_results["telemetry_uploaded"] = len(
                self.telemetry_buffer
            )
            self.telemetry_buffer = []

        # Pull model updates
        cloud_models = cloud_client.get_model_versions()
        for model_id, version in cloud_models.items():
            local_version = (
                self.local_model_registry.get(model_id, {})
                .get("version", "")
            )
            if version != local_version:
                cloud_client.download_model(
                    model_id, version
                )
                self.local_model_registry[model_id] = {
                    "version": version
                }
                sync_results["models_updated"] += 1

        # Apply queued changes
        for change in self.pending_sync_queue:
            cloud_client.apply_change(change)
            sync_results["config_changes_applied"] += 1
        self.pending_sync_queue = []

        # Resume normal operation
        self.connectivity_state = ConnectivityState.CONNECTED
        self.offline_start_time = None

        return sync_results

Edge Inference Latency vs Cloud (by Request Type)

Metric	Simple QA	Code Gen	Long Context (32K)	Multi-turn (10 turns)	Streaming (TTFT)
Edge (local L40S)	45	120	350	500	25
Cloud (A100, 100ms network)	150	220	450	600	130
Hybrid (edge + cloud overflow)	50	130	360	510	30

Key Takeaways

Extending Dynamo to edge and hybrid deployments requires architectural changes to handle intermittent connectivity, heterogeneous hardware, and autonomous operation.

The critical findings:

Edge reduces latency 2-3x for simple requests: Local inference eliminates 100-200ms of network round-trip. For streaming use cases (TTFT), edge delivers 25ms vs 130ms from cloud — a 5x improvement that users perceive as “instant.”
Heterogeneous sharding works with pipeline parallelism: Mixed GPU deployments (A100 + L40S + A10G) are viable using uneven pipeline-parallel stage assignment. The bottleneck is always the weakest GPU — plan accordingly.
72-hour offline operation is practical: Edge nodes can operate for 72 hours without cloud connectivity if models are pre-loaded and telemetry is buffered. Beyond 72 hours, model staleness and certificate expiration become concerns.
Synchronization on reconnect must be idempotent: The sync protocol must handle partial uploads, duplicate telemetry, and version conflicts. Every sync operation should be idempotent (safe to retry) and order-independent.
Quantized models are essential for edge: Full FP16 models require expensive GPU hardware. Q4_K_M quantization retains 95% of quality while fitting Llama 70B on 2x A10G (48GB total) instead of 4x A100 (320GB total) — an 85% hardware cost reduction.

Hybrid Architecture

Edge-Cloud Topology

Heterogeneous Hardware Support

Sharding Across Mixed GPUs

Heterogeneous Sharding: Llama 3.1 70B Across Mixed GPUs

Offline Mode and Synchronization

Operating Without Cloud Connectivity

Edge Inference Latency vs Cloud (by Request Type)

Key Takeaways

Stanley Phoong

Related Posts

Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime

Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size

Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy