Part of Series NVIDIA Dynamo & llm-d 33 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

At 30 tokens/sec on an H100 (2.50/hour),yourepaying2.50/hour), you're paying 0.023 per 1000 tokens. Doubling throughput to 60 tokens/sec via FP8 quantization cuts cost to 0.011samequality,halfthebill.Increasingbatchsizefrom128to512sequencespushesthroughputto89tokens/sec(0.011—same quality, half the bill. Increasing batch size from 128 to 512 sequences pushes throughput to 89 tokens/sec (0.0078 per 1K tokens) but P99 TTFT rises from 340ms to 1.2s, violating your SLO. The optimization problem: maximize tokens/sec while staying under latency constraints. This post provides the cost model and walks through real optimization: starting at 0.023,reaching0.023, reaching 0.009 through quantization + batch tuning + autoscaling.

The Cost Model

Cost per token is the fundamental metric for LLM serving economics.

def cost_per_token(
    gpu_cost_per_hour: float,
    tokens_per_second: float
) -> float:
    """
    Cost per token = GPU cost per second / tokens per second

    gpu_cost_per_hour: cloud GPU cost (e.g., $2.00/hr for A100)
    tokens_per_second: throughput of the serving system
    """
    gpu_cost_per_second = gpu_cost_per_hour / 3600
    return gpu_cost_per_second / tokens_per_second

def cost_model_detailed(
    num_gpus: int,
    gpu_cost_per_hour: float,
    throughput_tokens_per_sec: float,
    utilization_pct: float,
    overhead_pct: float = 5.0  # networking, storage, etc.
) -> dict:
    """
    Detailed cost model including utilization and overhead.
    """
    # Raw GPU cost
    gpu_cost_per_second = (num_gpus * gpu_cost_per_hour) / 3600

    # Effective throughput (accounting for utilization)
    effective_throughput = throughput_tokens_per_sec * (utilization_pct / 100)

    # Total cost including overhead
    total_cost_per_second = gpu_cost_per_second * (1 + overhead_pct / 100)

    cost_per_1k_tokens = (total_cost_per_second / effective_throughput) * 1000
    cost_per_1m_tokens = cost_per_1k_tokens * 1000

    # Monthly cost at sustained load
    monthly_cost = total_cost_per_second * 3600 * 24 * 30
    monthly_tokens = effective_throughput * 3600 * 24 * 30

    return {
        "cost_per_1k_tokens": cost_per_1k_tokens,
        "cost_per_1m_tokens": cost_per_1m_tokens,
        "monthly_gpu_cost": monthly_cost,
        "monthly_tokens_millions": monthly_tokens / 1e6,
        "effective_throughput": effective_throughput,
    }

# Example: Llama 3.1 70B on 8x A100
result = cost_model_detailed(
    num_gpus=8,
    gpu_cost_per_hour=2.00,  # per GPU
    throughput_tokens_per_sec=2000,
    utilization_pct=70
)
print(f"Cost per 1M tokens: ${result['cost_per_1m_tokens']:.2f}")
print(f"Monthly cost: ${result['monthly_gpu_cost']:.0f}")
📊

Cost per 1M Output Tokens (Llama 3.1 70B)

ConfigurationGPUsThroughput (tok/s)Cost/1M tokensMonthly Cost
8x A100, BF16, TP=8 8 1,200 $2.96 $34,560
8x A100, FP8, TP=8 8 2,000 $1.78 $34,560
4x H100, FP8, TP=4 4 3,500 $0.82 $28,800
8x H100, FP8, TP=8 8 6,000 $0.95 $57,600
2x H100, INT4 AWQ, TP=2 2 1,800 $0.79 $14,400
Performance

Cost per token and throughput are not the same optimization target. Adding more GPUs increases throughput but may increase cost per token if the additional GPUs are underutilized. The sweet spot is the minimum number of GPUs that can meet your latency SLO at peak load.

Batch Size Optimization

Batch size is the single most impactful lever for cost optimization. Larger batches amortize the fixed cost of model weight loading across more tokens.

def batch_size_analysis(
    model_params_b: float,
    kv_cache_per_token_bytes: int,
    gpu_memory_gb: float,
    num_gpus: int,
    dtype_bytes: int = 2  # FP16
) -> list:
    """
    Analyze throughput vs batch size.

    At small batch sizes: compute-bound (GPU underutilized)
    At large batch sizes: memory-bound (KV cache fills GPU)
    Optimal: largest batch that still meets latency SLO
    """
    model_memory = model_params_b * 1e9 * dtype_bytes / 1e9 / num_gpus  # GB per GPU
    available_kv_memory = gpu_memory_gb - model_memory - 2  # 2GB overhead

    results = []
    for batch_size in [1, 4, 8, 16, 32, 64, 128, 256]:
        # KV cache memory for this batch (assuming 2048 avg seq len)
        avg_seq_len = 2048
        kv_memory_gb = (
            batch_size * avg_seq_len * kv_cache_per_token_bytes / 1e9
        )

        if kv_memory_gb > available_kv_memory:
            break

        # Throughput model:
        # Small batch: compute-limited, throughput proportional to batch
        # Large batch: memory-limited, throughput saturates
        # Simplified: throughput = batch * single_token_rate * efficiency
        if batch_size <= 32:
            efficiency = 0.9  # Good GPU utilization
        elif batch_size <= 128:
            efficiency = 0.75  # Some memory bandwidth pressure
        else:
            efficiency = 0.6  # Heavy memory pressure

        single_token_rate = 50  # tokens/sec for batch=1
        throughput = batch_size * single_token_rate * efficiency

        # Per-token latency increases with batch size
        tpot_ms = 1000 / (throughput / batch_size)

        results.append({
            "batch_size": batch_size,
            "kv_memory_gb": kv_memory_gb,
            "throughput_tok_s": throughput,
            "tpot_ms": tpot_ms,
            "efficiency": efficiency,
            "cost_relative": 1.0 / throughput,  # Lower is better
        })

    return results

Throughput vs Batch Size (Llama 70B, 8x A100)

Batch 1
45
Batch 8
340
Batch 32
1,200
Batch 64
2,000
Batch 128
2,800
Batch 256
3,200
📊

Batch Size vs Latency vs Cost (Llama 70B FP8, 4x H100)

Batch SizeThroughput (tok/s)TPOT (ms)Cost/1M tokens
1 50 20 $14.40
8 380 21 $1.89
32 1,400 23 $0.51
64 2,400 27 $0.30
128 3,200 40 $0.23
256 3,600 71 $0.20

The table shows the core trade-off: batch 256 is cheapest per token (0.20/1M) but has 71ms TPOT. If your SLO requires TPOT &lt; 30ms, batch 64 is the cost-optimal configuration at 0.30/1M tokens.

Quantization Selection for Cost

Quantization reduces GPU count requirements, directly cutting cost.

def quantization_cost_analysis(
    model_params_b: float,
    gpu_cost_per_hour: float,
    gpu_memory_gb: float = 80  # A100
) -> list:
    """
    Compare serving cost across quantization levels.
    """
    quantizations = [
        {"name": "BF16", "bytes_per_param": 2, "quality_loss": 0.0,
         "throughput_multiplier": 1.0},
        {"name": "FP8", "bytes_per_param": 1, "quality_loss": 0.001,
         "throughput_multiplier": 1.8},  # 2x less memory, ~1.8x faster
        {"name": "INT8 (W8A8)", "bytes_per_param": 1, "quality_loss": 0.005,
         "throughput_multiplier": 1.7},
        {"name": "INT4 (AWQ)", "bytes_per_param": 0.5, "quality_loss": 0.02,
         "throughput_multiplier": 2.5},  # 4x less memory, ~2.5x faster
        {"name": "INT4 (GPTQ)", "bytes_per_param": 0.5, "quality_loss": 0.025,
         "throughput_multiplier": 2.3},
    ]

    results = []
    for q in quantizations:
        model_memory_gb = model_params_b * 1e9 * q["bytes_per_param"] / 1e9
        min_gpus = max(1, int(model_memory_gb / (gpu_memory_gb * 0.85)) + 1)
        # Round up to power of 2 for TP
        min_gpus = 2 ** (min_gpus - 1).bit_length() if min_gpus > 1 else 1

        # Base throughput for BF16 on min_gpus
        base_throughput = 400 * min_gpus  # Rough: 400 tok/s per GPU at batch=32
        throughput = base_throughput * q["throughput_multiplier"]

        hourly_cost = min_gpus * gpu_cost_per_hour
        cost_per_1m = (hourly_cost / 3600) / throughput * 1e6

        results.append({
            "quantization": q["name"],
            "model_memory_gb": model_memory_gb,
            "min_gpus": min_gpus,
            "throughput_tok_s": throughput,
            "hourly_cost": hourly_cost,
            "cost_per_1m_tokens": cost_per_1m,
            "quality_loss_ppl": q["quality_loss"],
        })

    return results

# Llama 3.1 70B
for r in quantization_cost_analysis(70, 2.00):
    print(f"{r['quantization']:12s}: {r['min_gpus']} GPUs, "
          f"{r['throughput_tok_s']:.0f} tok/s, "
          f"${r['cost_per_1m_tokens']:.2f}/1M tokens")

Cost per 1M Tokens by Quantization (70B, A100)

BF16 (8 GPU)
2.78
FP8 (4 GPU)
0.77
INT8 W8A8 (4 GPU)
0.82
INT4 AWQ (2 GPU)
0.44
INT4 GPTQ (2 GPU)
0.48
💡 Tip

INT4 quantization (AWQ/GPTQ) provides the best cost per token for most workloads. The 2% perplexity increase from INT4 is acceptable for chatbot and code generation tasks. For tasks requiring maximum quality (medical, legal), FP8 provides the best cost-quality trade-off.

Autoscaling Policies

Autoscaling matches GPU capacity to demand, avoiding paying for idle GPUs.

class CostAwareAutoscaler:
    """
    Autoscaling policy that minimizes cost while meeting SLOs.
    """

    def __init__(self, config: dict):
        self.min_replicas = config["min_replicas"]
        self.max_replicas = config["max_replicas"]
        self.target_utilization = config["target_utilization_pct"]
        self.scale_up_threshold = config["scale_up_threshold_pct"]
        self.scale_down_threshold = config["scale_down_threshold_pct"]
        self.scale_up_cooldown_sec = config.get("scale_up_cooldown", 60)
        self.scale_down_cooldown_sec = config.get("scale_down_cooldown", 300)
        self.slo_ttft_ms = config.get("slo_ttft_ms", 500)
        self.slo_tpot_ms = config.get("slo_tpot_ms", 50)

    def evaluate(self, metrics: dict) -> dict:
        """
        Evaluate current metrics and decide scaling action.

        metrics: {
            current_replicas: int,
            avg_utilization_pct: float,
            p99_ttft_ms: float,
            p99_tpot_ms: float,
            queue_depth: int,
            requests_per_second: float,
        }
        """
        current = metrics["current_replicas"]

        # SLO violation: ALWAYS scale up
        if (metrics["p99_ttft_ms"] > self.slo_ttft_ms or
            metrics["p99_tpot_ms"] > self.slo_tpot_ms):
            new_replicas = min(current + 1, self.max_replicas)
            return {
                "action": "scale_up",
                "reason": "slo_violation",
                "from": current,
                "to": new_replicas,
            }

        # High utilization: scale up proactively
        if metrics["avg_utilization_pct"] > self.scale_up_threshold:
            new_replicas = min(current + 1, self.max_replicas)
            return {
                "action": "scale_up",
                "reason": "high_utilization",
                "from": current,
                "to": new_replicas,
            }

        # Low utilization: scale down to save cost
        if metrics["avg_utilization_pct"] < self.scale_down_threshold:
            new_replicas = max(current - 1, self.min_replicas)
            if new_replicas < current:
                return {
                    "action": "scale_down",
                    "reason": "low_utilization",
                    "from": current,
                    "to": new_replicas,
                    "estimated_savings_per_hour": (
                        (current - new_replicas) * 8 * 2.00  # GPUs * cost
                    ),
                }

        return {"action": "none", "reason": "within_target"}

    def cost_savings_analysis(self, traffic_pattern: list) -> dict:
        """
        Estimate cost savings from autoscaling vs fixed capacity.

        traffic_pattern: list of (hour, requests_per_second) tuples
        representing a 24-hour traffic pattern.
        """
        # Fixed capacity: provision for peak
        peak_rps = max(rps for _, rps in traffic_pattern)
        gpus_per_replica = 8  # Llama 70B
        gpu_cost_per_hour = 2.00

        # Fixed: enough replicas for peak
        replicas_for_peak = max(1, int(peak_rps / 20) + 1)  # 20 rps per replica
        fixed_daily_cost = replicas_for_peak * gpus_per_replica * gpu_cost_per_hour * 24

        # Autoscaled: scale with demand
        autoscaled_daily_cost = 0
        for hour, rps in traffic_pattern:
            replicas_needed = max(self.min_replicas, int(rps / 20) + 1)
            replicas_needed = min(replicas_needed, self.max_replicas)
            autoscaled_daily_cost += replicas_needed * gpus_per_replica * gpu_cost_per_hour

        return {
            "fixed_daily_cost": fixed_daily_cost,
            "autoscaled_daily_cost": autoscaled_daily_cost,
            "daily_savings": fixed_daily_cost - autoscaled_daily_cost,
            "savings_pct": (1 - autoscaled_daily_cost / fixed_daily_cost) * 100,
        }
📊

Autoscaling Cost Savings (70B, Typical Traffic Pattern)

StrategyDaily CostPeak CapacityOff-Peak WasteSavings
Fixed (peak) $3,840 10 replicas 60% idle at night -
Autoscaled (reactive) $2,112 10 replicas 15% idle 45%
Autoscaled (predictive) $1,920 10 replicas 8% idle 50%
Fixed (average) $1,920 5 replicas SLO violations at peak 50%

Daily Cost: Fixed vs Autoscaled

Fixed (peak provision)
3,840
Autoscaled (reactive)
2,112
Autoscaled (predictive)
1,920

Preemption and Priority-Based Cost Allocation

Preemption allows high-priority requests to interrupt low-priority ones, improving cost efficiency by avoiding over-provisioning for peak priority traffic.

class PreemptionCostOptimizer:
    """
    Use preemption to serve mixed-priority workloads
    on fewer GPUs.
    """

    def __init__(self):
        self.priority_levels = {
            "realtime": {"slo_ttft_ms": 200, "slo_tpot_ms": 30, "preemptable": False},
            "interactive": {"slo_ttft_ms": 1000, "slo_tpot_ms": 50, "preemptable": False},
            "batch": {"slo_ttft_ms": 30000, "slo_tpot_ms": 100, "preemptable": True},
            "background": {"slo_ttft_ms": 60000, "slo_tpot_ms": 200, "preemptable": True},
        }

    def calculate_preemption_savings(
        self,
        realtime_rps: float,
        interactive_rps: float,
        batch_rps: float,
        background_rps: float,
        gpu_capacity_rps: float = 20
    ) -> dict:
        """
        Calculate GPU savings from preemption.

        Without preemption: each priority needs dedicated capacity
        With preemption: batch/background share GPUs with realtime
        """
        # Without preemption: isolated capacity
        total_rps = realtime_rps + interactive_rps + batch_rps + background_rps
        no_preempt_gpus = max(1, int(total_rps / gpu_capacity_rps) + 1)

        # With preemption: batch and background requests can be preempted
        # during traffic spikes, so they don't need dedicated capacity
        peak_rps = realtime_rps + interactive_rps
        preempt_gpus = max(1, int(peak_rps / gpu_capacity_rps) + 1)
        # Add some capacity for batch (they run when realtime is low)
        preempt_gpus += max(1, int(batch_rps / gpu_capacity_rps / 2))

        return {
            "without_preemption_gpus": no_preempt_gpus,
            "with_preemption_gpus": preempt_gpus,
            "gpu_savings": no_preempt_gpus - preempt_gpus,
            "cost_savings_pct": (1 - preempt_gpus / no_preempt_gpus) * 100,
        }

    def preemption_overhead(self, preempted_tokens: int) -> dict:
        """
        Cost of preemption: recomputing preempted request's KV cache.
        """
        # Preempted request loses its KV cache
        # Must recompute from scratch when resumed
        recompute_time_ms = preempted_tokens * 0.5  # ~0.5ms per token prefill
        wasted_tokens = preempted_tokens  # All generated tokens are lost

        return {
            "recompute_time_ms": recompute_time_ms,
            "wasted_compute_tokens": wasted_tokens,
            "recommendation": (
                "Checkpoint KV cache to CPU before preemption "
                "if preempted_tokens > 1000"
            )
        }
📊

Preemption Cost Analysis

ScenarioGPUs Without PreemptionGPUs With PreemptionSavings
Low batch (10% bg) 12 10 17%
Medium batch (30% bg) 16 11 31%
High batch (50% bg) 20 12 40%
Batch-dominated (70% bg) 24 13 46%

GPU Instance Selection

Different GPU instances have different price-performance characteristics.

def gpu_instance_comparison(model_params_b: float) -> list:
    """
    Compare GPU instances for cost per token.
    """
    instances = [
        {
            "name": "A100 80GB",
            "memory_gb": 80,
            "bf16_tflops": 312,
            "fp8_tflops": 624,
            "cost_per_hour": 2.00,
            "memory_bw_tb_s": 2.0,
        },
        {
            "name": "H100 80GB",
            "memory_gb": 80,
            "bf16_tflops": 990,
            "fp8_tflops": 1980,
            "cost_per_hour": 3.00,
            "memory_bw_tb_s": 3.35,
        },
        {
            "name": "H200 141GB",
            "memory_gb": 141,
            "bf16_tflops": 990,
            "fp8_tflops": 1980,
            "cost_per_hour": 4.00,
            "memory_bw_tb_s": 4.8,
        },
        {
            "name": "L40S 48GB",
            "memory_gb": 48,
            "bf16_tflops": 183,
            "fp8_tflops": 366,
            "cost_per_hour": 1.00,
            "memory_bw_tb_s": 0.86,
        },
    ]

    results = []
    for inst in instances:
        model_memory = model_params_b * 1e9 * 1 / 1e9  # FP8
        min_gpus = max(1, int(model_memory / (inst["memory_gb"] * 0.85)) + 1)
        # Round to power of 2
        if min_gpus > 1:
            min_gpus = 2 ** (min_gpus - 1).bit_length()

        # Throughput scales with memory bandwidth (decode-bound)
        # and compute (prefill-bound)
        throughput_per_gpu = inst["memory_bw_tb_s"] * 200  # rough tok/s per GPU
        total_throughput = throughput_per_gpu * min_gpus

        hourly_cost = min_gpus * inst["cost_per_hour"]
        cost_per_1m = (hourly_cost / 3600 / total_throughput) * 1e6

        results.append({
            "instance": inst["name"],
            "min_gpus": min_gpus,
            "throughput": total_throughput,
            "hourly_cost": hourly_cost,
            "cost_per_1m_tokens": cost_per_1m,
        })

    return results
📊

GPU Instance Cost Comparison (70B FP8)

GPUMin GPUsThroughput (tok/s)Hourly CostCost/1M tokens
A100 80GB 4 1,600 $8.00 $1.39
H100 80GB 4 2,680 $12.00 $1.24
H200 141GB 2 1,920 $8.00 $1.16
L40S 48GB 8 1,376 $8.00 $1.62
ℹ️ Note

H200 with 141GB HBM3e often provides the best cost per token because fewer GPUs are needed (reducing inter-GPU communication overhead) and the higher memory bandwidth (4.8 TB/s vs 3.35 TB/s for H100) directly improves decode throughput.

Prefix Caching for Cost Reduction

Prefix caching avoids redundant computation for repeated system prompts, directly reducing cost.

def prefix_caching_savings(
    system_prompt_tokens: int,
    avg_user_tokens: int,
    cache_hit_rate: float,
    prefill_cost_per_token_us: float = 500  # microseconds
) -> dict:
    """
    Calculate cost savings from prefix caching.

    When system prompts are cached, we skip their prefill computation.
    This directly reduces GPU utilization and cost per request.
    """
    # Without caching: prefill all tokens
    total_prefill_tokens = system_prompt_tokens + avg_user_tokens
    prefill_time_no_cache_ms = total_prefill_tokens * prefill_cost_per_token_us / 1000

    # With caching: skip system prompt prefill on cache hits
    avg_prefill_tokens = (
        (1 - cache_hit_rate) * total_prefill_tokens +
        cache_hit_rate * avg_user_tokens
    )
    prefill_time_cached_ms = avg_prefill_tokens * prefill_cost_per_token_us / 1000

    savings_pct = (1 - prefill_time_cached_ms / prefill_time_no_cache_ms) * 100

    return {
        "prefill_time_no_cache_ms": prefill_time_no_cache_ms,
        "prefill_time_cached_ms": prefill_time_cached_ms,
        "ttft_reduction_ms": prefill_time_no_cache_ms - prefill_time_cached_ms,
        "compute_savings_pct": savings_pct,
        "effective_cost_reduction_pct": savings_pct * 0.3,
        # 30% of cost is prefill for typical workloads
    }

# Example: 500-token system prompt, 200-token user query, 90% cache hit
result = prefix_caching_savings(500, 200, 0.90)
print(f"TTFT reduction: {result['ttft_reduction_ms']:.0f}ms")
print(f"Compute savings: {result['compute_savings_pct']:.0f}%")

Prefix Cache Hit Rate vs Cost Savings

0% (no caching)
0
50% hit rate
10
80% hit rate
17
95% hit rate
20
99% hit rate
21

Cost Optimization Checklist

def cost_optimization_checklist() -> dict:
    """
    Ordered checklist for reducing cost per token.
    Items ordered by impact (highest first).
    """
    return {
        "1_quantization": {
            "impact": "50-75% cost reduction",
            "action": "Use FP8 or INT4 (AWQ/GPTQ)",
            "risk": "Minor quality loss (0.1-2% perplexity)",
            "effort": "Low (vLLM/TRT-LLM support built-in)",
        },
        "2_batch_size": {
            "impact": "2-10x throughput improvement",
            "action": "Maximize batch size within latency SLO",
            "risk": "Increased tail latency",
            "effort": "Low (configuration change)",
        },
        "3_autoscaling": {
            "impact": "30-50% cost reduction",
            "action": "Scale down during off-peak hours",
            "risk": "Cold start latency on scale-up",
            "effort": "Medium (need monitoring + policy)",
        },
        "4_prefix_caching": {
            "impact": "10-25% compute savings",
            "action": "Enable prefix caching for repeated system prompts",
            "risk": "Memory overhead for cache storage",
            "effort": "Low (vLLM flag)",
        },
        "5_gpu_selection": {
            "impact": "10-30% cost difference",
            "action": "Use H200 or H100 over A100 when available",
            "risk": "Availability, vendor lock-in",
            "effort": "Low (instance type change)",
        },
        "6_preemption": {
            "impact": "15-45% GPU savings for mixed workloads",
            "action": "Use preemption for batch/background requests",
            "risk": "Wasted compute on preempted requests",
            "effort": "Medium (scheduling policy)",
        },
        "7_model_selection": {
            "impact": "Variable (right-size model for task)",
            "action": "Use smaller model for simple tasks, larger for complex",
            "risk": "Quality degradation on edge cases",
            "effort": "High (requires evaluation framework)",
        },
    }
📊

Cost Optimization Levers Summary

LeverImpactEffortRisk
Quantization (FP8/INT4) 50-75% reduction Low Minor quality loss
Batch Size Tuning 2-10x throughput Low Tail latency
Autoscaling 30-50% reduction Medium Cold start
Prefix Caching 10-25% compute savings Low Memory overhead
GPU Instance Selection 10-30% difference Low Availability
Preemption Policy 15-45% GPU savings Medium Wasted compute
Model Right-Sizing Variable High Quality risk

The path to minimum cost per token is: quantize first (biggest single impact), then maximize batch size within SLO, then add autoscaling, then enable prefix caching. Each lever compounds — applying all four can reduce cost by 80-90% compared to a naive BF16, batch-1, fixed-capacity deployment. Dynamo provides the infrastructure to implement all of these optimizations in a coordinated serving stack.