Dynamo was designed for cloud datacenters: reliable networking, homogeneous GPU nodes, centralized orchestration. Production deployments increasingly require on-premise and hybrid configurations. A hospital running medical AI needs inference on local hardware for HIPAA compliance. A factory running quality inspection AI needs low-latency inference without cloud round-trips. A defense contractor needs air-gapped inference with no external connectivity.
Extending Dynamo to edge and hybrid deployments introduces three challenges that cloud deployments avoid. First, network connectivity between edge nodes and cloud is intermittent or high-latency (50-200ms WAN vs 0.1ms LAN). Second, hardware is heterogeneous (a mix of A100, L40S, T4, and even Jetson devices). Third, the control plane must function during disconnection — edge nodes cannot depend on a cloud-hosted scheduler that they cannot reach.
This post covers the architecture for extending Dynamo to hybrid and edge deployments: local orchestration, model sharding across heterogeneous hardware, latency-aware routing, offline operation, and synchronization protocols.
Hybrid Architecture
Edge-Cloud Topology
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class NodeLocation(Enum):
CLOUD = "cloud"
EDGE_DATACENTER = "edge_datacenter"
ON_PREMISE = "on_premise"
FIELD_DEVICE = "field_device"
class ConnectivityState(Enum):
CONNECTED = "connected"
DEGRADED = "degraded"
DISCONNECTED = "disconnected"
@dataclass
class EdgeNode:
"""An edge inference node."""
node_id: str
location: NodeLocation
gpu_type: str
gpu_count: int
gpu_vram_gb: float
cpu_cores: int
ram_gb: int
connectivity: ConnectivityState
latency_to_cloud_ms: float
bandwidth_to_cloud_mbps: float
models_loaded: list = field(default_factory=list)
last_heartbeat: float = 0.0
class HybridDynamoArchitecture:
"""
Hybrid Dynamo deployment with cloud and edge tiers.
Architecture:
- Tier 1 (Cloud): Full Dynamo cluster with centralized
planner, KV-aware routing, and autoscaling.
Handles overflow and complex requests.
- Tier 2 (Edge Datacenter): Local Dynamo instance
with reduced planner. Handles most requests locally.
Syncs with cloud when connected.
- Tier 3 (On-Premise): Single-node or small cluster.
Runs independently with local routing.
Syncs configuration periodically.
- Tier 4 (Field Device): Jetson/embedded with
quantized small models. Fully autonomous.
Request routing priority:
1. Route to local edge if model available and
latency requirement met
2. Route to nearest edge datacenter if local cannot serve
3. Route to cloud as fallback
"""
def __init__(self, config):
self.cloud_endpoint = config.get("cloud_endpoint")
self.edge_nodes = {}
self.local_planner = LocalPlanner(config)
self.sync_manager = SyncManager(config)
def register_edge_node(self, node):
"""Register an edge node with the hybrid orchestrator."""
self.edge_nodes[node.node_id] = node
def route_request(self, request):
"""
Route a request to the best available tier.
Decision factors:
1. Latency requirement (strict SLA)
2. Model availability (is the model loaded locally?)
3. Current load on each tier
4. Network connectivity state
"""
latency_budget_ms = request.get(
"max_latency_ms", 1000
)
required_model = request.get("model")
# Try local edge nodes first
local_candidates = self._find_local_candidates(
required_model, latency_budget_ms
)
if local_candidates:
best = min(
local_candidates,
key=lambda n: n.latency_to_cloud_ms,
)
return {
"tier": "edge",
"node_id": best.node_id,
"estimated_latency_ms": 10,
"reason": "Local node has model loaded",
}
# Try edge datacenter
edge_dc = self._find_edge_datacenter(
required_model, latency_budget_ms
)
if edge_dc:
return {
"tier": "edge_datacenter",
"node_id": edge_dc.node_id,
"estimated_latency_ms": edge_dc.latency_to_cloud_ms,
"reason": "Edge DC available within SLA",
}
# Fall back to cloud
cloud_latency = self._estimate_cloud_latency()
if cloud_latency <= latency_budget_ms:
return {
"tier": "cloud",
"endpoint": self.cloud_endpoint,
"estimated_latency_ms": cloud_latency,
"reason": "Cloud fallback within SLA",
}
# Cannot meet SLA
return {
"tier": "none",
"reason": (
f"No tier can meet {latency_budget_ms}ms SLA"
),
}
def _find_local_candidates(self, model, latency_ms):
"""Find local nodes that have the model loaded."""
candidates = []
for node in self.edge_nodes.values():
if (
model in node.models_loaded
and node.connectivity != ConnectivityState.DISCONNECTED
and self._is_healthy(node)
):
candidates.append(node)
return candidates
def _find_edge_datacenter(self, model, latency_ms):
"""Find an edge datacenter within latency budget."""
for node in self.edge_nodes.values():
if (
node.location == NodeLocation.EDGE_DATACENTER
and model in node.models_loaded
and node.latency_to_cloud_ms < latency_ms * 0.5
):
return node
return None
def _estimate_cloud_latency(self):
"""Estimate latency to cloud endpoint."""
return 100.0 # Placeholder
def _is_healthy(self, node):
"""Check if node has sent recent heartbeat."""
return (time.time() - node.last_heartbeat) < 60
Edge deployments must handle the “split brain” problem: when an edge node loses connectivity to the cloud, both the edge and cloud may continue serving requests with diverging model versions or KV cache states. The resolution protocol must be defined before deployment: does the edge version take precedence? Does the cloud version? Is there a merge strategy?
Heterogeneous Hardware Support
Sharding Across Mixed GPUs
class HeterogeneousShardPlanner:
"""
Plan model sharding across mixed GPU hardware.
Challenge: standard tensor parallelism assumes
homogeneous GPUs. An A100 (80GB) and a T4 (16GB)
cannot each hold half a 70B model. The A100 must
hold more layers, and the T4 handles fewer layers.
Solution: pipeline parallelism with uneven stage
assignment. Assign layers proportional to GPU memory
and compute capability.
"""
GPU_SPECS = {
"H100": {
"vram_gb": 80,
"tflops_fp16": 989,
"bandwidth_gbs": 3350,
},
"A100": {
"vram_gb": 80,
"tflops_fp16": 312,
"bandwidth_gbs": 2039,
},
"L40S": {
"vram_gb": 48,
"tflops_fp16": 362,
"bandwidth_gbs": 864,
},
"A10G": {
"vram_gb": 24,
"tflops_fp16": 125,
"bandwidth_gbs": 600,
},
"T4": {
"vram_gb": 16,
"tflops_fp16": 65,
"bandwidth_gbs": 300,
},
"Jetson_Orin": {
"vram_gb": 32, # Shared memory
"tflops_fp16": 67,
"bandwidth_gbs": 204,
},
}
def plan_sharding(self, model_config, available_gpus):
"""
Plan how to shard a model across heterogeneous GPUs.
model_config: {n_layers, hidden_dim, n_heads, ...}
available_gpus: list of GPU types
"""
n_layers = model_config["n_layers"]
model_size_gb = model_config["size_gb"]
# Calculate per-layer memory
per_layer_gb = model_size_gb / n_layers
# Sort GPUs by memory (largest first)
gpu_specs = [
(gpu, self.GPU_SPECS.get(gpu, {}))
for gpu in available_gpus
]
gpu_specs.sort(
key=lambda x: x[1].get("vram_gb", 0),
reverse=True,
)
# Assign layers proportional to available memory
total_usable_memory = sum(
spec.get("vram_gb", 0) * 0.85 # 85% utilization
for _, spec in gpu_specs
)
if total_usable_memory < model_size_gb:
return {
"feasible": False,
"error": (
f"Need {model_size_gb:.1f} GB, "
f"have {total_usable_memory:.1f} GB"
),
}
assignments = []
remaining_layers = n_layers
for gpu_name, spec in gpu_specs:
usable = spec.get("vram_gb", 0) * 0.85
layers_for_gpu = min(
int(usable / per_layer_gb),
remaining_layers,
)
assignments.append({
"gpu": gpu_name,
"layers": layers_for_gpu,
"memory_used_gb": layers_for_gpu * per_layer_gb,
"memory_available_gb": spec.get("vram_gb", 0),
"compute_tflops": spec.get("tflops_fp16", 0),
})
remaining_layers -= layers_for_gpu
if remaining_layers <= 0:
break
# Check for bottleneck (slowest GPU determines throughput)
if assignments:
# Time per layer proportional to compute
times = [
a["layers"] / max(a["compute_tflops"], 1)
for a in assignments
]
bottleneck_idx = int(np.argmax(times))
bottleneck = assignments[bottleneck_idx]
else:
bottleneck = None
return {
"feasible": remaining_layers <= 0,
"assignments": assignments,
"bottleneck": bottleneck,
"total_layers": n_layers,
"assigned_layers": n_layers - remaining_layers,
}
Heterogeneous Sharding: Llama 3.1 70B Across Mixed GPUs
| Configuration | GPU Setup | Throughput (tok/s) | TTFT (ms) | Memory Used | Notes |
|---|---|---|---|---|---|
| Homogeneous baseline | 4x A100 80GB | 2400 | 45 | 280 GB | Standard TP=4 |
| Mixed high-end | 2x A100 + 2x L40S | 1800 | 65 | 256 GB | Pipeline parallel, L40S is bottleneck |
| Mixed with consumer | 1x A100 + 3x A10G | 900 | 120 | 152 GB | A10G limits throughput |
| All edge | 4x L40S 48GB | 1600 | 55 | 192 GB | Viable for edge datacenter |
| Quantized edge | 2x A10G (Q4_K_M) | 600 | 90 | 40 GB | Quantized fits 2 GPUs |
Offline Mode and Synchronization
Operating Without Cloud Connectivity
class OfflineOperationManager:
"""
Manage edge node operation during cloud disconnection.
Offline capabilities:
1. Continue serving with locally loaded models
2. Queue configuration updates for sync
3. Buffer telemetry data for upload
4. Local model versioning and rollback
5. Degraded routing (local-only, no cloud fallback)
"""
def __init__(self, config):
self.connectivity_state = ConnectivityState.CONNECTED
self.pending_sync_queue = []
self.telemetry_buffer = []
self.local_model_registry = {}
self.max_offline_hours = config.get(
"max_offline_hours", 72
)
self.offline_start_time = None
def enter_offline_mode(self):
"""
Transition to offline mode.
Actions:
1. Cache current routing table locally
2. Switch to local-only routing
3. Start buffering telemetry
4. Set offline timer
"""
self.connectivity_state = (
ConnectivityState.DISCONNECTED
)
self.offline_start_time = time.time()
return {
"status": "offline",
"cached_models": list(
self.local_model_registry.keys()
),
"max_offline_hours": self.max_offline_hours,
}
def check_offline_health(self):
"""
Check health during offline operation.
Concerns:
- Model staleness (running outdated version)
- KV cache growth (no distributed eviction)
- Telemetry buffer overflow
- Certificates and tokens expiring
"""
if self.offline_start_time is None:
return {"healthy": True}
offline_hours = (
time.time() - self.offline_start_time
) / 3600
issues = []
if offline_hours > self.max_offline_hours:
issues.append(
f"Offline for {offline_hours:.1f}h "
f"(max: {self.max_offline_hours}h)"
)
if len(self.telemetry_buffer) > 100000:
issues.append(
f"Telemetry buffer at {len(self.telemetry_buffer)} "
f"entries (risk of overflow)"
)
return {
"healthy": len(issues) == 0,
"offline_hours": round(offline_hours, 1),
"issues": issues,
"telemetry_buffered": len(self.telemetry_buffer),
}
def sync_on_reconnect(self, cloud_client):
"""
Synchronize state when connectivity is restored.
Sync protocol:
1. Upload buffered telemetry
2. Pull latest model versions
3. Pull updated routing configuration
4. Apply queued configuration changes
5. Resume normal operation
"""
sync_results = {
"telemetry_uploaded": 0,
"models_updated": 0,
"config_changes_applied": 0,
}
# Upload telemetry
if self.telemetry_buffer:
cloud_client.upload_telemetry(
self.telemetry_buffer
)
sync_results["telemetry_uploaded"] = len(
self.telemetry_buffer
)
self.telemetry_buffer = []
# Pull model updates
cloud_models = cloud_client.get_model_versions()
for model_id, version in cloud_models.items():
local_version = (
self.local_model_registry.get(model_id, {})
.get("version", "")
)
if version != local_version:
cloud_client.download_model(
model_id, version
)
self.local_model_registry[model_id] = {
"version": version
}
sync_results["models_updated"] += 1
# Apply queued changes
for change in self.pending_sync_queue:
cloud_client.apply_change(change)
sync_results["config_changes_applied"] += 1
self.pending_sync_queue = []
# Resume normal operation
self.connectivity_state = ConnectivityState.CONNECTED
self.offline_start_time = None
return sync_results
Edge Inference Latency vs Cloud (by Request Type)
| Metric | Simple QA | Code Gen | Long Context (32K) | Multi-turn (10 turns) | Streaming (TTFT) |
|---|---|---|---|---|---|
| Edge (local L40S) | |||||
| Cloud (A100, 100ms network) | |||||
| Hybrid (edge + cloud overflow) |
Key Takeaways
Extending Dynamo to edge and hybrid deployments requires architectural changes to handle intermittent connectivity, heterogeneous hardware, and autonomous operation.
The critical findings:
-
Edge reduces latency 2-3x for simple requests: Local inference eliminates 100-200ms of network round-trip. For streaming use cases (TTFT), edge delivers 25ms vs 130ms from cloud — a 5x improvement that users perceive as “instant.”
-
Heterogeneous sharding works with pipeline parallelism: Mixed GPU deployments (A100 + L40S + A10G) are viable using uneven pipeline-parallel stage assignment. The bottleneck is always the weakest GPU — plan accordingly.
-
72-hour offline operation is practical: Edge nodes can operate for 72 hours without cloud connectivity if models are pre-loaded and telemetry is buffered. Beyond 72 hours, model staleness and certificate expiration become concerns.
-
Synchronization on reconnect must be idempotent: The sync protocol must handle partial uploads, duplicate telemetry, and version conflicts. Every sync operation should be idempotent (safe to retry) and order-independent.
-
Quantized models are essential for edge: Full FP16 models require expensive GPU hardware. Q4_K_M quantization retains 95% of quality while fitting Llama 70B on 2x A10G (48GB total) instead of 4x A100 (320GB total) — an 85% hardware cost reduction.