Part of Series NVIDIA Dynamo & llm-d 28 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Speculative decoding with K=4 draft tokens and 75% acceptance rate generates 5 tokens in 2 forward passes instead of 5 passes—a 2.5x speedup. But coordinating draft and target models across a cluster introduces network latency: if the draft model on GPU pool A generates 4 tokens, sends them to the target model on GPU pool B, and B rejects 2 of them, you’ve wasted network round-trips. Dynamo optimizes this by co-locating frequently-paired draft-target instances, prefetching draft outputs during target compute, and tuning K dynamically based on acceptance rates. This post covers the distributed protocol and the math for optimal K selection.

Speculative Decoding Fundamentals

The core algorithm:

def speculative_decode_step(
    draft_model, target_model, input_ids: list,
    K: int, temperature: float
) -> list:
    """Generate up to K+1 tokens using speculative decoding."""
    # Step 1: Draft model generates K candidate tokens
    draft_tokens = []
    draft_probs = []
    current = input_ids.copy()

    for _ in range(K):
        logits = draft_model.forward(current)
        prob = softmax(logits[-1] / temperature)
        token = sample(prob)
        draft_tokens.append(token)
        draft_probs.append(prob)
        current.append(token)

    # Step 2: Target model verifies all K tokens in ONE forward pass
    # Input: original + K draft tokens
    # Output: logits for all K+1 positions
    target_logits = target_model.forward(input_ids + draft_tokens)

    # Step 3: Accept/reject each draft token
    accepted = []
    for i in range(K):
        target_prob = softmax(target_logits[len(input_ids) + i] / temperature)
        draft_prob = draft_probs[i]
        token = draft_tokens[i]

        # Acceptance criterion
        acceptance_ratio = target_prob[token] / draft_prob[token]
        if random.random() < min(1.0, acceptance_ratio):
            accepted.append(token)
        else:
            # Reject: sample from adjusted distribution
            adjusted = torch.clamp(target_prob - draft_prob, min=0)
            adjusted = adjusted / adjusted.sum()
            replacement = sample(adjusted)
            accepted.append(replacement)
            break  # Stop at first rejection

    # Step 4: If all K accepted, sample one more from target
    if len(accepted) == K:
        final_prob = softmax(target_logits[-1] / temperature)
        accepted.append(sample(final_prob))

    return accepted  # 1 to K+1 tokens

The expected number of tokens per step:

E[tokens per step]=1αK+11αE[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

where α\alpha is the per-token acceptance rate.

📊

Expected Tokens per Speculative Step

Acceptance RateK=3K=5K=7K=10
0.5 1.88 1.97 1.99 2.00
0.7 2.76 3.16 3.29 3.33
0.8 3.36 4.12 4.50 4.76
0.9 3.81 5.10 6.10 7.18
0.95 3.95 5.42 6.78 8.58

Cluster Architecture

Dynamo deploys speculative decoding across separate GPU pools:

                    ┌─────────────────────────────┐
                    │        Dynamo Router         │
                    │    (request dispatcher)       │
                    └─────────┬───────────────────┘

              ┌───────────────┴──────────────────┐
              │                                  │
    ┌─────────▼─────────┐            ┌──────────▼──────────┐
    │  Draft Pool        │            │  Target Pool        │
    │  4x A10G (24GB)    │            │  4x A100 (80GB)     │
    │  Llama 7B FP16     │            │  Llama 70B FP16     │
    │  High throughput    │            │  TP=4               │
    │  Low latency        │            │  Verification only   │
    └─────────┬──────────┘            └──────────▲──────────┘
              │                                  │
              │         draft tokens             │
              └──────────────────────────────────┘
class SpeculativeClusterConfig:
    def __init__(self):
        # Draft model: small, fast, cheap GPUs
        self.draft_model = "meta-llama/Llama-2-7b-hf"
        self.draft_gpus = ["a10g-0", "a10g-1", "a10g-2", "a10g-3"]
        self.draft_tp = 1  # No TP needed for 7B

        # Target model: large, expensive GPUs
        self.target_model = "meta-llama/Llama-2-70b-hf"
        self.target_gpus = ["a100-0", "a100-1", "a100-2", "a100-3"]
        self.target_tp = 4  # TP=4 across A100s

        # Speculation parameters
        self.K = 5  # Draft 5 tokens
        self.temperature = 0.7
ℹ️ Note

The draft model runs on cheaper GPUs (A10G at approximately 1/hr)whilethetargetmodelusesexpensiveA100s(approximately1/hr) while the target model uses expensive A100s (approximately 3/hr). Since the draft model generates tokens autoregressively (K sequential forward passes), it needs low latency per token. The target model only runs one forward pass per K tokens, so its utilization is lower but each pass is expensive.

Coordination Protocol

The network protocol between draft and target pools:

class SpeculativeCoordinator:
    def __init__(self, draft_pool, target_pool, K: int):
        self.draft_pool = draft_pool
        self.target_pool = target_pool
        self.K = K

    async def process_request(self, request):
        """End-to-end speculative decoding for one request."""
        input_ids = request.token_ids
        output_tokens = []

        while len(output_tokens) < request.max_tokens:
            # Phase 1: Draft generation (on draft pool)
            draft_result = await self.draft_pool.generate_draft(
                input_ids=input_ids + output_tokens,
                num_tokens=self.K,
                return_probs=True
            )

            # Phase 2: Verification (on target pool)
            verify_result = await self.target_pool.verify(
                input_ids=input_ids + output_tokens,
                draft_tokens=draft_result.tokens,
                draft_probs=draft_result.probs,
                temperature=request.temperature
            )

            # Phase 3: Accept tokens
            output_tokens.extend(verify_result.accepted_tokens)

            if verify_result.has_eos:
                break

        return output_tokens

Network Transfer Analysis

Each speculation round requires two network transfers:

# Transfer 1: Draft tokens + probs from draft pool to coordinator
# K token IDs: K * 4 bytes
# K probability distributions: K * vocab_size * 4 bytes
# For K=5, vocab=32000: 5 * 4 + 5 * 32000 * 4 = 640 KB

# Transfer 2: Input IDs + draft tokens to target pool
# (Input + output so far + K draft) token IDs for KV cache lookup
# Just the new K token IDs if KV cache is maintained: K * 4 = 20 bytes

# Transfer 3: Verification result back
# Accepted token IDs: up to (K+1) * 4 bytes = 24 bytes
# Target logits (if needed for next draft adjustment): K * vocab * 4 = 640 KB
📊

Network Transfer per Speculation Round

TransferDirectionSizeTime @ 100GbpsTime @ 25Gbps
Draft probs Draft -> Coordinator 640 KB 0.05 ms 0.20 ms
Verification input Coordinator -> Target 20 B ~0 ms ~0 ms
Target logits Target -> Coordinator 640 KB 0.05 ms 0.20 ms
Total round-trip --- 1.28 MB 0.10 ms 0.40 ms
Performance

The 640 KB probability transfers can be eliminated by sending only the draft token IDs (20 bytes) and having the target pool compute both the target and draft probabilities locally. This requires the target pool to also load the draft model’s LM head, adding minimal memory cost but reducing network transfer by 99.7%.

Optimal Speculation Depth K

The optimal KK depends on the acceptance rate α\alpha, draft model latency tdt_d, and target verification latency tvt_v:

def compute_speedup(alpha: float, K: int,
                    t_draft: float, t_verify: float,
                    t_network: float) -> float:
    """Compute speedup of speculative vs standard decoding.

    Args:
        alpha: per-token acceptance rate
        K: speculation depth
        t_draft: time for one draft token (ms)
        t_verify: time for target verification of K tokens (ms)
        t_network: round-trip network latency (ms)
    Returns:
        speedup ratio
    """
    # Expected tokens per speculation round
    expected_tokens = (1 - alpha**(K + 1)) / (1 - alpha)

    # Time per speculation round
    spec_time = K * t_draft + t_verify + t_network

    # Standard decoding time for same number of tokens
    standard_time = expected_tokens * t_verify

    return standard_time / spec_time
📊

Speedup by K and Acceptance Rate (t_draft=2ms, t_verify=15ms, t_net=0.5ms)

Kalpha=0.6alpha=0.7alpha=0.8alpha=0.9
3 1.52x 1.82x 2.15x 2.44x
5 1.48x 1.86x 2.36x 2.91x
7 1.38x 1.79x 2.39x 3.15x
10 1.22x 1.64x 2.32x 3.30x
15 1.02x 1.40x 2.10x 3.30x

Optimal K by Acceptance Rate

alpha=0.6 (K=3)
1.52
alpha=0.7 (K=5)
1.86
alpha=0.8 (K=7)
2.39
alpha=0.9 (K=7)
3.15
alpha=0.9 (K=10)
3.3

Key observations:

  1. At α=0.6\alpha = 0.6, K &gt; 5 actually hurts throughput because too many rejected draft tokens waste draft compute.
  2. At α=0.9\alpha = 0.9, increasing KK to 10 provides diminishing returns (3.30x vs 3.15x at K=7K=7).
  3. The draft model latency tdt_d is critical: if tdt_d doubles (4ms instead of 2ms), optimal KK drops by 1-2.

Adaptive Speculation Depth

Dynamo implements adaptive KK based on observed acceptance rates:

class AdaptiveSpeculationController:
    def __init__(self, min_k: int = 1, max_k: int = 10,
                 window_size: int = 100):
        self.min_k = min_k
        self.max_k = max_k
        self.window_size = window_size
        self.acceptance_history = []
        self.current_k = 5  # Start with K=5

    def update(self, num_accepted: int, num_proposed: int):
        """Update K based on recent acceptance rate."""
        rate = num_accepted / num_proposed if num_proposed > 0 else 0
        self.acceptance_history.append(rate)

        if len(self.acceptance_history) > self.window_size:
            self.acceptance_history.pop(0)

        if len(self.acceptance_history) < 10:
            return  # Not enough data

        avg_alpha = sum(self.acceptance_history) / len(self.acceptance_history)

        # Adjust K based on acceptance rate
        if avg_alpha > 0.9 and self.current_k < self.max_k:
            self.current_k += 1
        elif avg_alpha > 0.8:
            pass  # Keep current K
        elif avg_alpha > 0.6 and self.current_k > 3:
            self.current_k -= 1
        elif avg_alpha <= 0.6 and self.current_k > self.min_k:
            self.current_k = max(self.min_k, self.current_k - 2)

    @property
    def K(self) -> int:
        return self.current_k

KV Cache Management for Speculation

Speculative decoding creates a unique KV cache challenge: draft tokens occupy cache slots during verification, but rejected tokens must have their cache entries discarded:

class SpeculativeKVCacheManager:
    def __init__(self, block_manager):
        self.block_manager = block_manager

    def allocate_speculative_blocks(self, seq_id: int,
                                     K: int) -> list:
        """Pre-allocate K blocks for draft tokens."""
        blocks = []
        for _ in range(K // self.block_manager.block_size + 1):
            block = self.block_manager.allocate()
            blocks.append(block)
        return blocks

    def commit_accepted(self, seq_id: int,
                        num_accepted: int, K: int):
        """Keep KV cache for accepted tokens, discard the rest."""
        total_allocated = K
        num_rejected = total_allocated - num_accepted

        if num_rejected > 0:
            # Rollback KV cache for rejected tokens
            # This means adjusting the sequence length counter
            # The physical blocks remain allocated but the
            # slot pointers are rewound
            self.block_manager.rollback_tokens(
                seq_id, num_rejected
            )
⚠️ Warning

With K=5K=5 and 256 concurrent sequences, speculation pre-allocates 256×5=1,280256 \times 5 = 1,280 extra token slots. At 16 tokens per block, that is 80 additional blocks. For Llama 70B, this is 80×5.24 MB=419 MB80 \times 5.24 \text{ MB} = 419 \text{ MB} of speculative KV cache that may be discarded. This overhead must be accounted for in the memory budget.

Draft Model Selection

The choice of draft model determines the acceptance rate. Options:

# Option 1: Smaller model from same family
# Draft: Llama 7B, Target: Llama 70B
# Acceptance rate: ~0.7-0.8 for general text
# Acceptance rate: ~0.5-0.6 for code/math

# Option 2: Distilled draft model
# Draft: Custom 1B distilled from 70B
# Acceptance rate: ~0.8-0.9 (higher agreement)
# Cost: requires distillation training

# Option 3: n-gram model (Medusa-style)
# Draft: Learned n-gram heads on target model
# Acceptance rate: ~0.6-0.7
# Cost: minimal additional parameters

# Option 4: Self-speculation (layer skipping)
# Draft: target model with early exit after layer N/3
# Acceptance rate: ~0.75-0.85
# Cost: no additional model, but target GPU must run draft too
📊

Draft Model Options for Llama 70B Target

Draft ModelDraft ParamsDraft Latency (ms)Acceptance RateOverall Speedup
Llama 7B 7B 2.1 0.75 2.2x
Llama 1B (distilled) 1.3B 0.8 0.82 2.8x
n-gram heads 50M 0.3 0.65 2.0x
Self-spec (L=26/80) 70B partial 5.2 0.88 2.1x
Eagle (autoregressive head) 0.5B 0.5 0.78 2.7x

Batched Verification

The target model verifies multiple sequences’ draft tokens simultaneously:

class BatchedVerifier:
    def __init__(self, target_model, max_batch: int):
        self.target_model = target_model
        self.max_batch = max_batch

    def verify_batch(self, batch_drafts: list) -> list:
        """Verify draft tokens for a batch of sequences.

        Args:
            batch_drafts: list of (seq_id, input_ids, draft_tokens, draft_probs)
        Returns:
            list of (seq_id, accepted_tokens)
        """
        # Pad all sequences to same length for batched forward pass
        max_len = max(
            len(d[1]) + len(d[2]) for d in batch_drafts
        )

        input_batch = torch.zeros(len(batch_drafts), max_len, dtype=torch.long)
        attention_mask = torch.zeros(len(batch_drafts), max_len)

        for i, (_, input_ids, draft_tokens, _) in enumerate(batch_drafts):
            full_seq = input_ids + draft_tokens
            input_batch[i, :len(full_seq)] = torch.tensor(full_seq)
            attention_mask[i, :len(full_seq)] = 1.0

        # Single batched forward pass
        logits = self.target_model.forward(
            input_batch, attention_mask=attention_mask
        )

        # Per-sequence acceptance
        results = []
        for i, (seq_id, input_ids, draft_tokens, draft_probs) in enumerate(batch_drafts):
            accepted = self._verify_single(
                logits[i], len(input_ids), draft_tokens, draft_probs
            )
            results.append((seq_id, accepted))

        return results

Batched verification amortizes the target model forward pass across multiple sequences. With 64 sequences each having K=5K=5 draft tokens, the target model processes a batch with 64 rows of up to KK new tokens — similar cost to a standard decode step.

Production Benchmarks

📊

End-to-End Speculative Decoding — Dynamo Cluster

ConfigThroughput (tok/s)Avg Latency/tok (ms)Speedup vs BaselineCost/M Tokens
70B standard (4xA100) 5,120 12.5 1.0x $0.65
70B + 7B spec K=5 (4xA100 + 4xA10G) 9,850 6.8 1.92x $0.50
70B + 1B spec K=7 (4xA100 + 1xA10G) 12,400 5.4 2.42x $0.37
70B + Eagle K=5 (4xA100) 11,200 5.9 2.19x $0.32
70B standard (4xH100) 7,840 8.2 1.53x (vs A100) $0.35

Throughput: Speculative vs Standard Decoding

Standard 4xA100
5,120
Spec 7B K=5
9,850
Spec 1B K=7
12,400
Eagle K=5
11,200
Standard 4xH100
7,840

The 1B distilled draft model with K=7K=7 achieves 2.42x speedup, bringing the cost per million tokens to $0.37 — comparable to H100 without speculation but on cheaper A100 hardware.

Failure Modes and Mitigations

# Failure mode 1: Low acceptance rate
# Cause: draft model too different from target
# Symptom: avg accepted < 2 tokens per round
# Fix: increase temperature, use better draft model, reduce K

# Failure mode 2: Draft model bottleneck
# Cause: draft model too slow (large draft or slow GPU)
# Symptom: draft time > target verification time
# Fix: use smaller draft, faster GPU, or self-speculation

# Failure mode 3: Network latency dominates
# Cause: draft and target on different datacenters
# Symptom: network round-trip > draft generation time
# Fix: co-locate draft and target, or use self-speculation

# Failure mode 4: KV cache exhaustion
# Cause: speculative blocks waste too much memory
# Symptom: preemptions increase with speculation enabled
# Fix: reduce K, reduce max_num_seqs, or add GPU memory

Summary

Dynamo’s cluster-scale speculative decoding separates draft and target models onto different GPU pools, coordinating through a lightweight network protocol. The draft pool (cheap GPUs running a small model) proposes KK tokens per round, and the target pool (expensive GPUs running the full model) verifies them in a single batched forward pass. Optimal KK depends on the acceptance rate: K=5K=5 for α=0.7\alpha=0.7, K=7K=7 for α=0.80.9\alpha=0.8-0.9. Adaptive KK tracking adjusts speculation depth based on rolling acceptance statistics. A 1B distilled draft model with K=7K=7 achieves 2.42x throughput improvement over standard decoding, reducing cost per million tokens from 0.65to0.65 to 0.37. The primary constraints are draft model quality (acceptance rate), draft model speed (sequential generation), and speculative KV cache overhead (approximately K×batch_sizeK \times \text{batch\_size} extra token slots).