1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond A chatbot serving 200 QPS with 500-token prompts and 150-token outputs needs 32 H100s at 70% target utilization—but naive calculation allocates all 32 for decode, causing prefill to bottleneck at 12.3s P99 TTFT (SLO violation). Dynamo’s disaggregated model requires you to split the pool: 12 GPUs for prefill (compute-bound), 20 GPUs for decode (bandwidth-bound). The math depends on FLOP/s vs GB/s bottleneck, prompt vs output length ratios, and burst headroom. Get it wrong and you either waste 40% of capacity or miss SLOs. This post derives the capacity equations with working calculator code.
The Fundamental Capacity Equation
import math
from dataclasses import dataclass
@dataclass
class WorkloadProfile:
"""Characterize your workload for capacity planning."""
peak_qps: float # Peak queries per second
avg_input_tokens: int # Average input (prompt) tokens
avg_output_tokens: int # Average output (generated) tokens
p99_input_tokens: int # 99th percentile input tokens
p99_output_tokens: int # 99th percentile output tokens
daily_pattern: str # "flat", "business_hours", "global"
peak_to_average_ratio: float # Peak QPS / Average QPS
@dataclass
class SLORequirements:
"""Service Level Objectives."""
ttft_p99_ms: float # Time to first token, 99th percentile
itl_p99_ms: float # Inter-token latency, 99th percentile
total_latency_p99_ms: float # End-to-end latency, 99th percentile
availability: float # e.g., 0.999 = 99.9% uptime
@dataclass
class GPUProfile:
"""GPU performance characteristics for a specific model."""
gpu_type: str
model_name: str
tp_degree: int # Tensor parallelism degree
prefill_throughput_tps: float # Tokens/sec for prefill (per TP group)
decode_throughput_tps: float # Tokens/sec for decode (per TP group)
max_batch_size: int
kv_cache_per_token_bytes: int # KV cache memory per token
gpu_memory_bytes: int # Total GPU memory
cost_per_gpu_hour: float # USD
def compute_minimum_gpus(
workload: WorkloadProfile,
slo: SLORequirements,
gpu: GPUProfile,
utilization_target: float = 0.70,
) -> dict:
"""
Compute minimum GPU count for a Dynamo deployment.
Returns breakdown of prefill GPUs, decode GPUs, and total cost.
"""
# Total tokens per second needed
total_input_tps = workload.peak_qps * workload.avg_input_tokens
total_output_tps = workload.peak_qps * workload.avg_output_tokens
# Prefill GPU requirement
# Each prefill GPU group processes input tokens at prefill_throughput_tps
prefill_groups_needed = math.ceil(
total_input_tps / (gpu.prefill_throughput_tps * utilization_target)
)
prefill_gpus = prefill_groups_needed * gpu.tp_degree
# Decode GPU requirement
# Each decode GPU group generates output tokens at decode_throughput_tps
decode_groups_needed = math.ceil(
total_output_tps / (gpu.decode_throughput_tps * utilization_target)
)
decode_gpus = decode_groups_needed * gpu.tp_degree
# Memory check: can each decode group hold enough KV cache?
max_concurrent_sequences = workload.peak_qps * (
workload.avg_output_tokens / gpu.decode_throughput_tps
)
kv_memory_needed = (
max_concurrent_sequences *
(workload.avg_input_tokens + workload.avg_output_tokens) *
gpu.kv_cache_per_token_bytes
)
kv_memory_per_group = kv_memory_needed / decode_groups_needed
model_memory = gpu.gpu_memory_bytes * gpu.tp_degree * 0.35 # ~35% for model
available_kv_memory = gpu.gpu_memory_bytes * gpu.tp_degree * 0.55 # ~55% for KV
if kv_memory_per_group > available_kv_memory:
# Need more decode groups for memory
decode_groups_needed = math.ceil(
kv_memory_needed / available_kv_memory
)
decode_gpus = decode_groups_needed * gpu.tp_degree
total_gpus = prefill_gpus + decode_gpus
# Cost
hourly_cost = total_gpus * gpu.cost_per_gpu_hour
monthly_cost = hourly_cost * 24 * 30
return {
'prefill_gpus': prefill_gpus,
'decode_gpus': decode_gpus,
'total_gpus': total_gpus,
'prefill_decode_ratio': prefill_gpus / max(decode_gpus, 1),
'utilization_target': utilization_target,
'hourly_cost_usd': hourly_cost,
'monthly_cost_usd': monthly_cost,
'cost_per_1k_queries': (hourly_cost / 3600) / workload.peak_qps * 1000,
}
Prefill vs Decode GPU Ratio
Why the Ratio Matters
def analyze_prefill_decode_ratio(workload, gpu):
"""
Compute the optimal prefill-to-decode GPU ratio.
The ratio depends on:
1. Input/output token ratio
2. Prefill vs decode throughput per GPU
3. Whether prefill or decode is the bottleneck
"""
# Compute time per query on each type of GPU
prefill_time_per_query = workload.avg_input_tokens / gpu.prefill_throughput_tps
decode_time_per_query = workload.avg_output_tokens / gpu.decode_throughput_tps
# Ratio of time spent on each phase
time_ratio = prefill_time_per_query / decode_time_per_query
# Optimal GPU ratio matches the time ratio
# If prefill takes 2x as long as decode per query,
# you need 2x as many prefill GPU-groups
optimal_ratio = time_ratio
scenarios = {
'chatbot': {
'description': 'Short prompts (128 tokens), long outputs (512 tokens)',
'avg_input': 128,
'avg_output': 512,
'prefill_time': 128 / gpu.prefill_throughput_tps,
'decode_time': 512 / gpu.decode_throughput_tps,
},
'summarization': {
'description': 'Long prompts (4096 tokens), short outputs (256 tokens)',
'avg_input': 4096,
'avg_output': 256,
'prefill_time': 4096 / gpu.prefill_throughput_tps,
'decode_time': 256 / gpu.decode_throughput_tps,
},
'code_generation': {
'description': 'Medium prompts (1024 tokens), medium outputs (1024 tokens)',
'avg_input': 1024,
'avg_output': 1024,
'prefill_time': 1024 / gpu.prefill_throughput_tps,
'decode_time': 1024 / gpu.decode_throughput_tps,
},
'rag_qa': {
'description': 'Long context (8192 tokens), short answers (128 tokens)',
'avg_input': 8192,
'avg_output': 128,
'prefill_time': 8192 / gpu.prefill_throughput_tps,
'decode_time': 128 / gpu.decode_throughput_tps,
},
}
for name, s in scenarios.items():
ratio = s['prefill_time'] / s['decode_time']
s['optimal_prefill_decode_ratio'] = round(ratio, 2)
s['recommendation'] = (
f"{max(1, round(ratio))} prefill : 1 decode GPU groups"
)
return scenarios, optimal_ratio
| Workload | Avg Input | Avg Output | Prefill Time | Decode Time | Optimal Ratio |
| Chatbot | 128 tokens | 512 tokens | 2.6ms | 64ms | 1:25 (decode heavy) |
| Summarization | 4096 tokens | 256 tokens | 82ms | 32ms | 2.5:1 (prefill heavy) |
| Code Generation | 1024 tokens | 1024 tokens | 20ms | 128ms | 1:6 (decode heavy) |
| RAG Q&A | 8192 tokens | 128 tokens | 164ms | 16ms | 10:1 (prefill heavy) |
| Agent (Multi-turn) | 2048 tokens | 256 tokens | 41ms | 32ms | 1.3:1 (balanced) |
Note: RAG and summarization workloads are prefill-heavy and benefit most from disaggregated serving. Chatbots are decode-heavy and may be better served with co-located prefill+decode.
Complete Capacity Planning Calculator
class CapacityPlanningCalculator:
"""
Complete capacity planning tool for Dynamo deployments.
Takes workload profile, SLO requirements, and GPU specs.
Outputs GPU count, configuration, and cost estimate.
"""
def __init__(self):
self.gpu_profiles = self._load_gpu_profiles()
def _load_gpu_profiles(self):
"""Known GPU performance profiles for common models."""
return {
('llama-70b', 'h100'): GPUProfile(
gpu_type='H100',
model_name='Llama 3.1 70B',
tp_degree=4,
prefill_throughput_tps=50000, # tokens/sec for prefill
decode_throughput_tps=8000, # tokens/sec for decode
max_batch_size=256,
kv_cache_per_token_bytes=2560, # 80 layers * 8 heads * 128 dim * 2 (KV) * 2 (BF16)
gpu_memory_bytes=80 * 1024 ** 3, # 80GB
cost_per_gpu_hour=3.50,
),
('llama-8b', 'h100'): GPUProfile(
gpu_type='H100',
model_name='Llama 3.1 8B',
tp_degree=1,
prefill_throughput_tps=120000,
decode_throughput_tps=25000,
max_batch_size=512,
kv_cache_per_token_bytes=512,
gpu_memory_bytes=80 * 1024 ** 3,
cost_per_gpu_hour=3.50,
),
('llama-405b', 'h100'): GPUProfile(
gpu_type='H100',
model_name='Llama 3.1 405B',
tp_degree=8,
prefill_throughput_tps=20000,
decode_throughput_tps=3000,
max_batch_size=128,
kv_cache_per_token_bytes=6400,
gpu_memory_bytes=80 * 1024 ** 3,
cost_per_gpu_hour=3.50,
),
}
def plan(self, workload, slo, model_key, strategy="disaggregated"):
"""
Generate a complete capacity plan.
Args:
workload: WorkloadProfile
slo: SLORequirements
model_key: Tuple of (model_name, gpu_type)
strategy: "disaggregated" or "colocated"
"""
gpu = self.gpu_profiles[model_key]
if strategy == "disaggregated":
result = self._plan_disaggregated(workload, slo, gpu)
else:
result = self._plan_colocated(workload, slo, gpu)
# Add burst headroom
result = self._add_burst_headroom(result, workload)
# Add redundancy for availability
result = self._add_redundancy(result, slo)
# Cost summary
result['cost_summary'] = self._compute_costs(result, gpu)
return result
def _plan_disaggregated(self, workload, slo, gpu):
"""Plan with separate prefill and decode pools."""
# Prefill capacity
total_prefill_tps = workload.peak_qps * workload.avg_input_tokens
prefill_groups = math.ceil(total_prefill_tps / (gpu.prefill_throughput_tps * 0.70))
# Decode capacity
total_decode_tps = workload.peak_qps * workload.avg_output_tokens
decode_groups = math.ceil(total_decode_tps / (gpu.decode_throughput_tps * 0.70))
# SLO check: TTFT
prefill_time = workload.p99_input_tokens / gpu.prefill_throughput_tps * 1000
overhead_ms = 5 # Router + scheduler + transfer
ttft_estimate = prefill_time + overhead_ms
if ttft_estimate > slo.ttft_p99_ms:
# Need more prefill parallelism
prefill_groups = math.ceil(
prefill_groups * ttft_estimate / slo.ttft_p99_ms
)
# SLO check: ITL
itl_estimate = 1000 / (gpu.decode_throughput_tps / workload.peak_qps)
if itl_estimate > slo.itl_p99_ms:
decode_groups = math.ceil(
decode_groups * itl_estimate / slo.itl_p99_ms
)
return {
'strategy': 'disaggregated',
'prefill_groups': prefill_groups,
'decode_groups': decode_groups,
'prefill_gpus': prefill_groups * gpu.tp_degree,
'decode_gpus': decode_groups * gpu.tp_degree,
'total_gpus': (prefill_groups + decode_groups) * gpu.tp_degree,
'estimated_ttft_p99_ms': ttft_estimate,
'estimated_itl_p99_ms': itl_estimate,
}
def _plan_colocated(self, workload, slo, gpu):
"""Plan with co-located prefill and decode on same GPUs."""
total_tps = workload.peak_qps * (
workload.avg_input_tokens + workload.avg_output_tokens
)
# Co-located throughput is lower due to prefill-decode interference
effective_throughput = min(
gpu.prefill_throughput_tps * 0.6, # Prefill slowed by decode sharing
gpu.decode_throughput_tps * 0.8, # Decode slowed by prefill sharing
)
groups = math.ceil(total_tps / (effective_throughput * 0.70))
return {
'strategy': 'colocated',
'groups': groups,
'total_gpus': groups * gpu.tp_degree,
}
def _add_burst_headroom(self, plan, workload):
"""Add capacity for traffic bursts."""
burst_factor = workload.peak_to_average_ratio
headroom_factor = 1.0 + max(0, (burst_factor - 1.5) * 0.5)
# If peak/avg > 1.5, add proportional headroom
plan['burst_headroom_factor'] = headroom_factor
plan['total_gpus_with_burst'] = math.ceil(
plan['total_gpus'] * headroom_factor
)
return plan
def _add_redundancy(self, plan, slo):
"""Add GPU redundancy for availability target."""
if slo.availability >= 0.999:
# 99.9% availability: N+2 redundancy
plan['redundancy_gpus'] = 2 * plan.get(
'prefill_gpus', 0
) // max(plan.get('prefill_groups', 1), 1) + 2
elif slo.availability >= 0.99:
# 99% availability: N+1 redundancy
plan['redundancy_gpus'] = plan.get(
'prefill_gpus', 0
) // max(plan.get('prefill_groups', 1), 1) + 1
else:
plan['redundancy_gpus'] = 0
plan['total_gpus_final'] = plan['total_gpus_with_burst'] + plan.get('redundancy_gpus', 0)
return plan
def _compute_costs(self, plan, gpu):
"""Compute cost breakdown."""
total_gpus = plan['total_gpus_final']
hourly = total_gpus * gpu.cost_per_gpu_hour
return {
'total_gpus': total_gpus,
'hourly_cost': hourly,
'daily_cost': hourly * 24,
'monthly_cost': hourly * 24 * 30,
'annual_cost': hourly * 24 * 365,
'cost_per_1m_tokens': (
hourly / (gpu.decode_throughput_tps * 3600 * total_gpus / gpu.tp_degree)
) * 1e6,
}
Worked Examples
Example 1: Chatbot Service
chatbot_workload = WorkloadProfile(
peak_qps=100,
avg_input_tokens=256,
avg_output_tokens=512,
p99_input_tokens=1024,
p99_output_tokens=2048,
daily_pattern="business_hours",
peak_to_average_ratio=2.5,
)
chatbot_slo = SLORequirements(
ttft_p99_ms=500,
itl_p99_ms=50,
total_latency_p99_ms=30000,
availability=0.999,
)
calculator = CapacityPlanningCalculator()
plan = calculator.plan(chatbot_workload, chatbot_slo, ('llama-70b', 'h100'))
| Component | Count | Purpose |
| Prefill GPU Groups (4 GPUs each) | 2 | Handle 25.6K tokens/sec input |
| Decode GPU Groups (4 GPUs each) | 10 | Handle 51.2K tokens/sec output |
| Prefill GPUs | 8 | |
| Decode GPUs | 40 | |
| Burst Headroom (+25%) | 12 | For 2.5x peak/average ratio |
| Redundancy | 8 | N+2 for 99.9% availability |
| Total GPUs | 68 | $238/hour, $171K/month |
Note: Chatbot workloads are heavily decode-bound: 83% of GPUs are allocated to decode. The prefill:decode ratio is 1:5.
Example 2: RAG Service
rag_workload = WorkloadProfile(
peak_qps=50,
avg_input_tokens=8192,
avg_output_tokens=256,
p99_input_tokens=16384,
p99_output_tokens=512,
daily_pattern="flat",
peak_to_average_ratio=1.3,
)
rag_slo = SLORequirements(
ttft_p99_ms=2000, # 2 seconds TTFT acceptable for RAG
itl_p99_ms=50,
total_latency_p99_ms=15000,
availability=0.999,
)
| Component | Count | Purpose |
| Prefill GPU Groups | 12 | Handle 409.6K tokens/sec input |
| Decode GPU Groups | 2 | Handle 12.8K tokens/sec output |
| Prefill GPUs | 48 | |
| Decode GPUs | 8 | |
| Burst + Redundancy | 10 | |
| Total GPUs | 66 | $231/hour, $166K/month |
Note: RAG workloads are heavily prefill-bound: 73% of GPUs handle prefill. The prefill:decode ratio is 6:1 -- the inverse of the chatbot case.
Dynamic Scaling
Autoscaling Based on Queue Depth
class DynamoAutoscaler:
"""
Autoscale GPU allocation based on real-time metrics.
"""
def __init__(self, min_gpus, max_gpus, scale_up_threshold=0.85,
scale_down_threshold=0.40, cooldown_seconds=120):
self.min_gpus = min_gpus
self.max_gpus = max_gpus
self.scale_up_threshold = scale_up_threshold
self.scale_down_threshold = scale_down_threshold
self.cooldown_seconds = cooldown_seconds
self.last_scale_time = 0
def evaluate(self, metrics):
"""
Evaluate whether to scale based on current metrics.
Metrics:
- gpu_utilization: average across fleet
- queue_depth: pending requests
- slo_violation_rate: fraction of requests missing SLO
- ttft_p99: current P99 TTFT
"""
current_time = time.time()
if current_time - self.last_scale_time < self.cooldown_seconds:
return {'action': 'none', 'reason': 'cooldown'}
# Scale up conditions (any one triggers)
scale_up = False
reason = ""
if metrics['gpu_utilization'] > self.scale_up_threshold:
scale_up = True
reason = f"GPU utilization {metrics['gpu_utilization']:.0%} > {self.scale_up_threshold:.0%}"
if metrics['slo_violation_rate'] > 0.01: # More than 1% SLO violations
scale_up = True
reason = f"SLO violation rate {metrics['slo_violation_rate']:.1%} > 1%"
if metrics['queue_depth'] > 100:
scale_up = True
reason = f"Queue depth {metrics['queue_depth']} > 100"
if scale_up:
# Calculate how many GPUs to add
if metrics['slo_violation_rate'] > 0.05:
gpus_to_add = 8 # Aggressive scale for high violation
else:
gpus_to_add = 4 # Gradual scale
self.last_scale_time = current_time
return {
'action': 'scale_up',
'gpus_to_add': gpus_to_add,
'reason': reason,
}
# Scale down conditions (all must be true)
if (metrics['gpu_utilization'] < self.scale_down_threshold and
metrics['slo_violation_rate'] == 0 and
metrics['queue_depth'] < 10):
gpus_to_remove = 4
self.last_scale_time = current_time
return {
'action': 'scale_down',
'gpus_to_remove': gpus_to_remove,
'reason': f"Low utilization {metrics['gpu_utilization']:.0%}",
}
return {'action': 'none', 'reason': 'within thresholds'}
| Metric | 0 | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 |
| Static Provisioning | | | | | | | | | | | | |
| Autoscaled | | | | | | | | | | | | |
Autoscaling reduces average GPU usage by 30-45% for workloads with business-hours patterns (peak-to-average ratio above 2x). At 3.50/GPU−hourforH100s,a68−GPUfleetsaves35K-50K/month with autoscaling. The tradeoff: cold-start latency when scaling up. Dynamo’s model pre-loading (ModelExpress) reduces cold-start from minutes to 15-30 seconds by streaming model weights from NVMe.
Cost Optimization Strategies
COST_OPTIMIZATION_STRATEGIES = {
"spot_instances_for_decode": {
"description": (
"Use spot/preemptible instances for decode workers. "
"Decode is more tolerant of preemption because: "
"1. KV cache can be rebuilt from prefill "
"2. Partially generated responses can be resumed "
"3. Decode workers are stateless except for KV cache"
),
"savings": "50-70% on decode GPU cost",
"risk": "Momentary latency spike during preemption",
},
"mixed_gpu_types": {
"description": (
"Use H100 for prefill (compute-bound, benefits from high FLOPS) "
"and A100 for decode (memory-bandwidth-bound, A100 is sufficient). "
"A100 costs 60% of H100 per hour."
),
"savings": "25-35% total cost",
"risk": "More complex fleet management",
},
"kv_cache_tiering": {
"description": (
"Tier KV cache: hot entries in GPU HBM, warm in CPU DRAM, "
"cold in NVMe. Reduces GPU memory pressure, allowing "
"larger batch sizes and fewer GPUs."
),
"savings": "15-25% fewer GPUs needed",
"risk": "Increased TTFT for cold cache entries",
},
"request_coalescing": {
"description": (
"Batch similar requests to share KV cache prefix. "
"System prompts, common prefixes, and RAG contexts "
"can be shared across requests."
),
"savings": "20-40% prefill compute savings",
"risk": "Requires prefix-aware routing",
},
}
Capacity planning for Dynamo is not a one-time calculation. It is a continuous optimization loop: measure actual traffic patterns, compare against planned capacity, adjust GPU allocation, and re-evaluate costs. The formulas in this post provide the starting point; production experience refines the numbers. The most common mistake is over-provisioning decode GPUs for prefill-heavy workloads (RAG, summarization) or over-provisioning prefill GPUs for decode-heavy workloads (chatbots, code generation). Getting the prefill-to-decode ratio right typically saves 20-30% of total GPU cost compared to a naive 1:1 allocation.