Speculative decoding with K=4 draft tokens and 75% acceptance rate generates 5 tokens in 2 forward passes instead of 5 passes—a 2.5x speedup. But coordinating draft and target models across a cluster introduces network latency: if the draft model on GPU pool A generates 4 tokens, sends them to the target model on GPU pool B, and B rejects 2 of them, you’ve wasted network round-trips. Dynamo optimizes this by co-locating frequently-paired draft-target instances, prefetching draft outputs during target compute, and tuning K dynamically based on acceptance rates. This post covers the distributed protocol and the math for optimal K selection.
Speculative Decoding Fundamentals
The core algorithm:
def speculative_decode_step(
draft_model, target_model, input_ids: list,
K: int, temperature: float
) -> list:
"""Generate up to K+1 tokens using speculative decoding."""
# Step 1: Draft model generates K candidate tokens
draft_tokens = []
draft_probs = []
current = input_ids.copy()
for _ in range(K):
logits = draft_model.forward(current)
prob = softmax(logits[-1] / temperature)
token = sample(prob)
draft_tokens.append(token)
draft_probs.append(prob)
current.append(token)
# Step 2: Target model verifies all K tokens in ONE forward pass
# Input: original + K draft tokens
# Output: logits for all K+1 positions
target_logits = target_model.forward(input_ids + draft_tokens)
# Step 3: Accept/reject each draft token
accepted = []
for i in range(K):
target_prob = softmax(target_logits[len(input_ids) + i] / temperature)
draft_prob = draft_probs[i]
token = draft_tokens[i]
# Acceptance criterion
acceptance_ratio = target_prob[token] / draft_prob[token]
if random.random() < min(1.0, acceptance_ratio):
accepted.append(token)
else:
# Reject: sample from adjusted distribution
adjusted = torch.clamp(target_prob - draft_prob, min=0)
adjusted = adjusted / adjusted.sum()
replacement = sample(adjusted)
accepted.append(replacement)
break # Stop at first rejection
# Step 4: If all K accepted, sample one more from target
if len(accepted) == K:
final_prob = softmax(target_logits[-1] / temperature)
accepted.append(sample(final_prob))
return accepted # 1 to K+1 tokens
The expected number of tokens per step:
where is the per-token acceptance rate.
Expected Tokens per Speculative Step
| Acceptance Rate | K=3 | K=5 | K=7 | K=10 |
|---|---|---|---|---|
| 0.5 | 1.88 | 1.97 | 1.99 | 2.00 |
| 0.7 | 2.76 | 3.16 | 3.29 | 3.33 |
| 0.8 | 3.36 | 4.12 | 4.50 | 4.76 |
| 0.9 | 3.81 | 5.10 | 6.10 | 7.18 |
| 0.95 | 3.95 | 5.42 | 6.78 | 8.58 |
Cluster Architecture
Dynamo deploys speculative decoding across separate GPU pools:
┌─────────────────────────────┐
│ Dynamo Router │
│ (request dispatcher) │
└─────────┬───────────────────┘
│
┌───────────────┴──────────────────┐
│ │
┌─────────▼─────────┐ ┌──────────▼──────────┐
│ Draft Pool │ │ Target Pool │
│ 4x A10G (24GB) │ │ 4x A100 (80GB) │
│ Llama 7B FP16 │ │ Llama 70B FP16 │
│ High throughput │ │ TP=4 │
│ Low latency │ │ Verification only │
└─────────┬──────────┘ └──────────▲──────────┘
│ │
│ draft tokens │
└──────────────────────────────────┘
class SpeculativeClusterConfig:
def __init__(self):
# Draft model: small, fast, cheap GPUs
self.draft_model = "meta-llama/Llama-2-7b-hf"
self.draft_gpus = ["a10g-0", "a10g-1", "a10g-2", "a10g-3"]
self.draft_tp = 1 # No TP needed for 7B
# Target model: large, expensive GPUs
self.target_model = "meta-llama/Llama-2-70b-hf"
self.target_gpus = ["a100-0", "a100-1", "a100-2", "a100-3"]
self.target_tp = 4 # TP=4 across A100s
# Speculation parameters
self.K = 5 # Draft 5 tokens
self.temperature = 0.7
The draft model runs on cheaper GPUs (A10G at approximately 3/hr). Since the draft model generates tokens autoregressively (K sequential forward passes), it needs low latency per token. The target model only runs one forward pass per K tokens, so its utilization is lower but each pass is expensive.
Coordination Protocol
The network protocol between draft and target pools:
class SpeculativeCoordinator:
def __init__(self, draft_pool, target_pool, K: int):
self.draft_pool = draft_pool
self.target_pool = target_pool
self.K = K
async def process_request(self, request):
"""End-to-end speculative decoding for one request."""
input_ids = request.token_ids
output_tokens = []
while len(output_tokens) < request.max_tokens:
# Phase 1: Draft generation (on draft pool)
draft_result = await self.draft_pool.generate_draft(
input_ids=input_ids + output_tokens,
num_tokens=self.K,
return_probs=True
)
# Phase 2: Verification (on target pool)
verify_result = await self.target_pool.verify(
input_ids=input_ids + output_tokens,
draft_tokens=draft_result.tokens,
draft_probs=draft_result.probs,
temperature=request.temperature
)
# Phase 3: Accept tokens
output_tokens.extend(verify_result.accepted_tokens)
if verify_result.has_eos:
break
return output_tokens
Network Transfer Analysis
Each speculation round requires two network transfers:
# Transfer 1: Draft tokens + probs from draft pool to coordinator
# K token IDs: K * 4 bytes
# K probability distributions: K * vocab_size * 4 bytes
# For K=5, vocab=32000: 5 * 4 + 5 * 32000 * 4 = 640 KB
# Transfer 2: Input IDs + draft tokens to target pool
# (Input + output so far + K draft) token IDs for KV cache lookup
# Just the new K token IDs if KV cache is maintained: K * 4 = 20 bytes
# Transfer 3: Verification result back
# Accepted token IDs: up to (K+1) * 4 bytes = 24 bytes
# Target logits (if needed for next draft adjustment): K * vocab * 4 = 640 KB
Network Transfer per Speculation Round
| Transfer | Direction | Size | Time @ 100Gbps | Time @ 25Gbps |
|---|---|---|---|---|
| Draft probs | Draft -> Coordinator | 640 KB | 0.05 ms | 0.20 ms |
| Verification input | Coordinator -> Target | 20 B | ~0 ms | ~0 ms |
| Target logits | Target -> Coordinator | 640 KB | 0.05 ms | 0.20 ms |
| Total round-trip | --- | 1.28 MB | 0.10 ms | 0.40 ms |
The 640 KB probability transfers can be eliminated by sending only the draft token IDs (20 bytes) and having the target pool compute both the target and draft probabilities locally. This requires the target pool to also load the draft model’s LM head, adding minimal memory cost but reducing network transfer by 99.7%.
Optimal Speculation Depth K
The optimal depends on the acceptance rate , draft model latency , and target verification latency :
def compute_speedup(alpha: float, K: int,
t_draft: float, t_verify: float,
t_network: float) -> float:
"""Compute speedup of speculative vs standard decoding.
Args:
alpha: per-token acceptance rate
K: speculation depth
t_draft: time for one draft token (ms)
t_verify: time for target verification of K tokens (ms)
t_network: round-trip network latency (ms)
Returns:
speedup ratio
"""
# Expected tokens per speculation round
expected_tokens = (1 - alpha**(K + 1)) / (1 - alpha)
# Time per speculation round
spec_time = K * t_draft + t_verify + t_network
# Standard decoding time for same number of tokens
standard_time = expected_tokens * t_verify
return standard_time / spec_time
Speedup by K and Acceptance Rate (t_draft=2ms, t_verify=15ms, t_net=0.5ms)
| K | alpha=0.6 | alpha=0.7 | alpha=0.8 | alpha=0.9 |
|---|---|---|---|---|
| 3 | 1.52x | 1.82x | 2.15x | 2.44x |
| 5 | 1.48x | 1.86x | 2.36x | 2.91x |
| 7 | 1.38x | 1.79x | 2.39x | 3.15x |
| 10 | 1.22x | 1.64x | 2.32x | 3.30x |
| 15 | 1.02x | 1.40x | 2.10x | 3.30x |
Optimal K by Acceptance Rate
Key observations:
- At , K > 5 actually hurts throughput because too many rejected draft tokens waste draft compute.
- At , increasing to 10 provides diminishing returns (3.30x vs 3.15x at ).
- The draft model latency is critical: if doubles (4ms instead of 2ms), optimal drops by 1-2.
Adaptive Speculation Depth
Dynamo implements adaptive based on observed acceptance rates:
class AdaptiveSpeculationController:
def __init__(self, min_k: int = 1, max_k: int = 10,
window_size: int = 100):
self.min_k = min_k
self.max_k = max_k
self.window_size = window_size
self.acceptance_history = []
self.current_k = 5 # Start with K=5
def update(self, num_accepted: int, num_proposed: int):
"""Update K based on recent acceptance rate."""
rate = num_accepted / num_proposed if num_proposed > 0 else 0
self.acceptance_history.append(rate)
if len(self.acceptance_history) > self.window_size:
self.acceptance_history.pop(0)
if len(self.acceptance_history) < 10:
return # Not enough data
avg_alpha = sum(self.acceptance_history) / len(self.acceptance_history)
# Adjust K based on acceptance rate
if avg_alpha > 0.9 and self.current_k < self.max_k:
self.current_k += 1
elif avg_alpha > 0.8:
pass # Keep current K
elif avg_alpha > 0.6 and self.current_k > 3:
self.current_k -= 1
elif avg_alpha <= 0.6 and self.current_k > self.min_k:
self.current_k = max(self.min_k, self.current_k - 2)
@property
def K(self) -> int:
return self.current_k
KV Cache Management for Speculation
Speculative decoding creates a unique KV cache challenge: draft tokens occupy cache slots during verification, but rejected tokens must have their cache entries discarded:
class SpeculativeKVCacheManager:
def __init__(self, block_manager):
self.block_manager = block_manager
def allocate_speculative_blocks(self, seq_id: int,
K: int) -> list:
"""Pre-allocate K blocks for draft tokens."""
blocks = []
for _ in range(K // self.block_manager.block_size + 1):
block = self.block_manager.allocate()
blocks.append(block)
return blocks
def commit_accepted(self, seq_id: int,
num_accepted: int, K: int):
"""Keep KV cache for accepted tokens, discard the rest."""
total_allocated = K
num_rejected = total_allocated - num_accepted
if num_rejected > 0:
# Rollback KV cache for rejected tokens
# This means adjusting the sequence length counter
# The physical blocks remain allocated but the
# slot pointers are rewound
self.block_manager.rollback_tokens(
seq_id, num_rejected
)
With and 256 concurrent sequences, speculation pre-allocates extra token slots. At 16 tokens per block, that is 80 additional blocks. For Llama 70B, this is of speculative KV cache that may be discarded. This overhead must be accounted for in the memory budget.
Draft Model Selection
The choice of draft model determines the acceptance rate. Options:
# Option 1: Smaller model from same family
# Draft: Llama 7B, Target: Llama 70B
# Acceptance rate: ~0.7-0.8 for general text
# Acceptance rate: ~0.5-0.6 for code/math
# Option 2: Distilled draft model
# Draft: Custom 1B distilled from 70B
# Acceptance rate: ~0.8-0.9 (higher agreement)
# Cost: requires distillation training
# Option 3: n-gram model (Medusa-style)
# Draft: Learned n-gram heads on target model
# Acceptance rate: ~0.6-0.7
# Cost: minimal additional parameters
# Option 4: Self-speculation (layer skipping)
# Draft: target model with early exit after layer N/3
# Acceptance rate: ~0.75-0.85
# Cost: no additional model, but target GPU must run draft too
Draft Model Options for Llama 70B Target
| Draft Model | Draft Params | Draft Latency (ms) | Acceptance Rate | Overall Speedup |
|---|---|---|---|---|
| Llama 7B | 7B | 2.1 | 0.75 | 2.2x |
| Llama 1B (distilled) | 1.3B | 0.8 | 0.82 | 2.8x |
| n-gram heads | 50M | 0.3 | 0.65 | 2.0x |
| Self-spec (L=26/80) | 70B partial | 5.2 | 0.88 | 2.1x |
| Eagle (autoregressive head) | 0.5B | 0.5 | 0.78 | 2.7x |
Batched Verification
The target model verifies multiple sequences’ draft tokens simultaneously:
class BatchedVerifier:
def __init__(self, target_model, max_batch: int):
self.target_model = target_model
self.max_batch = max_batch
def verify_batch(self, batch_drafts: list) -> list:
"""Verify draft tokens for a batch of sequences.
Args:
batch_drafts: list of (seq_id, input_ids, draft_tokens, draft_probs)
Returns:
list of (seq_id, accepted_tokens)
"""
# Pad all sequences to same length for batched forward pass
max_len = max(
len(d[1]) + len(d[2]) for d in batch_drafts
)
input_batch = torch.zeros(len(batch_drafts), max_len, dtype=torch.long)
attention_mask = torch.zeros(len(batch_drafts), max_len)
for i, (_, input_ids, draft_tokens, _) in enumerate(batch_drafts):
full_seq = input_ids + draft_tokens
input_batch[i, :len(full_seq)] = torch.tensor(full_seq)
attention_mask[i, :len(full_seq)] = 1.0
# Single batched forward pass
logits = self.target_model.forward(
input_batch, attention_mask=attention_mask
)
# Per-sequence acceptance
results = []
for i, (seq_id, input_ids, draft_tokens, draft_probs) in enumerate(batch_drafts):
accepted = self._verify_single(
logits[i], len(input_ids), draft_tokens, draft_probs
)
results.append((seq_id, accepted))
return results
Batched verification amortizes the target model forward pass across multiple sequences. With 64 sequences each having draft tokens, the target model processes a batch with 64 rows of up to new tokens — similar cost to a standard decode step.
Production Benchmarks
End-to-End Speculative Decoding — Dynamo Cluster
| Config | Throughput (tok/s) | Avg Latency/tok (ms) | Speedup vs Baseline | Cost/M Tokens |
|---|---|---|---|---|
| 70B standard (4xA100) | 5,120 | 12.5 | 1.0x | $0.65 |
| 70B + 7B spec K=5 (4xA100 + 4xA10G) | 9,850 | 6.8 | 1.92x | $0.50 |
| 70B + 1B spec K=7 (4xA100 + 1xA10G) | 12,400 | 5.4 | 2.42x | $0.37 |
| 70B + Eagle K=5 (4xA100) | 11,200 | 5.9 | 2.19x | $0.32 |
| 70B standard (4xH100) | 7,840 | 8.2 | 1.53x (vs A100) | $0.35 |
Throughput: Speculative vs Standard Decoding
The 1B distilled draft model with achieves 2.42x speedup, bringing the cost per million tokens to $0.37 — comparable to H100 without speculation but on cheaper A100 hardware.
Failure Modes and Mitigations
# Failure mode 1: Low acceptance rate
# Cause: draft model too different from target
# Symptom: avg accepted < 2 tokens per round
# Fix: increase temperature, use better draft model, reduce K
# Failure mode 2: Draft model bottleneck
# Cause: draft model too slow (large draft or slow GPU)
# Symptom: draft time > target verification time
# Fix: use smaller draft, faster GPU, or self-speculation
# Failure mode 3: Network latency dominates
# Cause: draft and target on different datacenters
# Symptom: network round-trip > draft generation time
# Fix: co-locate draft and target, or use self-speculation
# Failure mode 4: KV cache exhaustion
# Cause: speculative blocks waste too much memory
# Symptom: preemptions increase with speculation enabled
# Fix: reduce K, reduce max_num_seqs, or add GPU memory
Summary
Dynamo’s cluster-scale speculative decoding separates draft and target models onto different GPU pools, coordinating through a lightweight network protocol. The draft pool (cheap GPUs running a small model) proposes tokens per round, and the target pool (expensive GPUs running the full model) verifies them in a single batched forward pass. Optimal depends on the acceptance rate: for , for . Adaptive tracking adjusts speculation depth based on rolling acceptance statistics. A 1B distilled draft model with achieves 2.42x throughput improvement over standard decoding, reducing cost per million tokens from 0.37. The primary constraints are draft model quality (acceptance rate), draft model speed (sequential generation), and speculative KV cache overhead (approximately extra token slots).