Speculative decoding generates draft tokens with a fast model and verifies them in a single forward pass of the target model. If the draft model is cheap enough and its acceptance rate is high enough, the wall-clock latency per token decreases because the verification step processes multiple tokens in parallel (as a prefill-like batch) instead of generating them one at a time.
The draft model’s cost is the critical variable. A smaller draft model is cheaper per token but has a lower acceptance rate. Quantization offers a third axis: keep the same architecture but compress it, reducing per-token cost without reducing the model’s knowledge as much as shrinking the parameter count. An INT4 draft model of the same architecture as the target can be 4x smaller and 2-3x faster per token while maintaining a higher acceptance rate than a smaller FP16 model of equivalent memory size.
This post analyzes the mathematics of quantized draft models, the system design for co-locating draft and target on the same GPU, and the measured performance.
Speculative Decoding Fundamentals
The Acceptance-Rejection Mechanism
Given a target model and a draft model , speculative decoding generates draft tokens autoregressively from , then verifies all tokens in one forward pass of . The acceptance probability for each draft token is:
\alpha_i = \min\left(1, \frac{p(x_i | x_{<i})}{q(x_i | x_{<i})}\right)
If a draft token is rejected, it is replaced by a sample from the residual distribution:
The expected number of accepted tokens per speculation round is:
For a constant acceptance rate , this simplifies to:
# Speedup model for speculative decoding
def speculative_speedup(
alpha, # Per-token acceptance rate
k, # Number of draft tokens per round
t_draft_per_token, # Time for one draft model forward pass
t_target_verify, # Time for target model to verify k tokens
t_target_single # Time for target model single-token decode
):
"""
Calculate the speedup from speculative decoding over standard decoding.
"""
# Expected accepted tokens per round
expected_accepted = (1 - alpha**(k+1)) / (1 - alpha)
# Total time per round:
# k draft generations + 1 target verification
time_per_round = k * t_draft_per_token + t_target_verify
# Tokens produced per round (accepted + 1 from rejection sampling)
tokens_per_round = expected_accepted # +1 from residual sample, included
# Effective time per token with speculation
time_per_token_spec = time_per_round / tokens_per_round
# Without speculation: one target forward pass per token
time_per_token_base = t_target_single
return time_per_token_base / time_per_token_spec
Why Draft Model Speed Matters More Than Size
The speedup formula reveals that appears multiplied by in the numerator of the time-per-round expression. Reducing draft model latency by 2x has the same effect as halving (while keeping the same acceptance rate), which is almost always beneficial.
# Example: Llama-2-70B target, various draft models
# H100 SXM, batch size 1 decode
# Baseline: 70B FP16 standard decode
t_target = 22.0 # ms per token
# Scenario A: 7B FP16 draft model
# Acceptance rate with FP16 7B: ~0.78
# Draft time: 5.8 ms per token
# Verify time (k=5): 24.5 ms (slightly more than single decode)
speedup_A = speculative_speedup(0.78, 5, 5.8, 24.5, 22.0)
# speedup_A ~ 1.65x
# Scenario B: 7B INT4 draft model
# Acceptance rate with INT4 7B: ~0.72 (slightly lower due to quantization)
# Draft time: 2.1 ms per token (4x compression -> memory-bound speedup)
# Verify time (k=7): 25.2 ms (can afford more draft tokens since cheaper)
speedup_B = speculative_speedup(0.72, 7, 2.1, 25.2, 22.0)
# speedup_B ~ 2.05x
# Scenario C: 1.5B FP16 draft model (smaller architecture)
# Acceptance rate with FP16 1.5B: ~0.58 (much lower quality)
# Draft time: 1.8 ms per token
# Verify time (k=8): 26.0 ms
speedup_C = speculative_speedup(0.58, 8, 1.8, 26.0, 22.0)
# speedup_C ~ 1.52x
# The INT4 7B draft (B) wins: it is almost as fast as the tiny 1.5B model
# but has much higher acceptance rate because it is the same architecture.
Speculative Decoding Speedup by Draft Model Configuration
(x speedup over standard decode)Quantization’s Effect on Acceptance Rate
The Acceptance Rate vs Quantization Bits Trade-off
Quantization introduces noise into the draft model’s probability distribution. This noise reduces the acceptance rate because deviates from not just due to model size differences but also due to quantization error in the logits.
# Measuring acceptance rate vs quantization level
# Target: Llama-2-70B FP16
# Draft: Llama-2-7B at various quantization levels
# 500 prompts from ShareGPT, 256 tokens each
acceptance_rates = {
# draft_config: (acceptance_rate, draft_ms_per_token, memory_gb)
"7B FP16": (0.782, 5.8, 13.5),
"7B INT8": (0.771, 3.8, 6.8),
"7B INT4 g128": (0.724, 2.1, 3.5),
"7B INT4 g32": (0.738, 2.4, 3.9),
"7B INT3 g128": (0.651, 1.8, 2.8),
"7B INT2 AQLM": (0.542, 2.3, 2.0), # Codebook overhead hurts speed
}
# Key observations:
# INT8 barely hurts acceptance rate (-0.011)
# INT4 with g128 drops acceptance by 0.058 but halves draft time
# INT4 with g32 (finer groups) recovers 0.014 acceptance at slight speed cost
# Below 4 bits, acceptance drops faster than speed improves
# The optimal is INT4 with group_size=32-128
Why INT4 is the Sweet Spot for Draft Models
# Net speedup analysis accounting for acceptance rate degradation
def net_speedup_analysis(draft_configs, target_ms=22.0, verify_overhead=1.12):
"""
Calculate net speedup for each draft configuration.
verify_overhead: ratio of verify time to single decode time
(verifying k tokens takes slightly more than 1 decode)
"""
results = {}
for name, (alpha, draft_ms, mem_gb) in draft_configs.items():
best_speedup = 0
best_k = 0
for k in range(1, 15):
verify_ms = target_ms * verify_overhead # Approximately constant
expected = (1 - alpha**(k+1)) / (1 - alpha)
time_per_round = k * draft_ms + verify_ms
tokens_per_round = expected
speedup = target_ms / (time_per_round / tokens_per_round)
if speedup > best_speedup:
best_speedup = speedup
best_k = k
results[name] = {
'speedup': best_speedup,
'optimal_k': best_k,
'memory': mem_gb
}
return results
# Results:
# 7B FP16: speedup=1.65x, k=5, mem=13.5 GB
# 7B INT8: speedup=1.82x, k=6, mem=6.8 GB
# 7B INT4 g128: speedup=2.05x, k=7, mem=3.5 GB <-- BEST SPEEDUP
# 7B INT4 g32: speedup=2.01x, k=7, mem=3.9 GB
# 7B INT3 g128: speedup=1.78x, k=8, mem=2.8 GB <-- Speed gain < acceptance loss
# 7B INT2 AQLM: speedup=1.41x, k=8, mem=2.0 GB <-- Not worth it
INT4 quantization is the optimal compression level for draft models because the per-token latency reduction (2-3x) more than compensates for the acceptance rate drop (5-8%). Below INT4, the acceptance rate degrades faster than latency improves, and the codebook overhead of extreme compression methods (AQLM, QuIP#) erodes the latency advantage.
Memory Budget: Co-Locating Draft and Target
GPU Memory Layout
Both models must reside on the same GPU (or GPU set) for speculative decoding to work without cross-device communication overhead. The memory budget is:
# Memory budget analysis for speculative decoding on H100-80GB
def memory_budget(
target_params_B, target_bits,
draft_params_B, draft_bits,
max_seq_len, max_batch_size,
n_layers_target, n_layers_draft,
d_model_target, d_model_draft,
n_kv_heads_target, n_kv_heads_draft,
head_dim
):
# Target model weights
target_weight_gb = target_params_B * 1e9 * target_bits / 8 / 1e9
# Draft model weights
draft_weight_gb = draft_params_B * 1e9 * draft_bits / 8 / 1e9
# KV cache for target model
# 2 (K+V) * n_layers * n_kv_heads * head_dim * max_seq_len * batch_size * bytes
kv_target_gb = (2 * n_layers_target * n_kv_heads_target * head_dim *
max_seq_len * max_batch_size * 2) / 1e9 # FP16
# KV cache for draft model
kv_draft_gb = (2 * n_layers_draft * n_kv_heads_draft * head_dim *
max_seq_len * max_batch_size * 2) / 1e9
# Activation memory (temporary, shared between models)
activation_gb = 2.0 # Rough estimate
total = target_weight_gb + draft_weight_gb + kv_target_gb + kv_draft_gb + activation_gb
return {
'target_weights': target_weight_gb,
'draft_weights': draft_weight_gb,
'target_kv': kv_target_gb,
'draft_kv': kv_draft_gb,
'activations': activation_gb,
'total': total
}
# Scenario: Llama-2-70B FP16 target + 7B INT4 draft on H100-80GB
budget = memory_budget(
target_params_B=70, target_bits=16,
draft_params_B=7, draft_bits=4,
max_seq_len=4096, max_batch_size=1,
n_layers_target=80, n_layers_draft=32,
d_model_target=8192, d_model_draft=4096,
n_kv_heads_target=8, n_kv_heads_draft=8,
head_dim=128
)
# target_weights: 140 GB -> DOES NOT FIT on 1x H100
# With INT8 target + INT4 draft:
budget_int8 = memory_budget(
target_params_B=70, target_bits=8,
draft_params_B=7, draft_bits=4,
max_seq_len=4096, max_batch_size=1,
n_layers_target=80, n_layers_draft=32,
d_model_target=8192, d_model_draft=4096,
n_kv_heads_target=8, n_kv_heads_draft=8,
head_dim=128
)
# target_weights: 70 GB + draft_weights: 3.5 GB + KV: ~5 GB
# Total: ~78.5 GB -> fits barely on 1x H100-80GB
# With INT4 target + INT4 draft (same architecture, different quant):
budget_int4 = memory_budget(
target_params_B=70, target_bits=4,
draft_params_B=7, draft_bits=4,
max_seq_len=4096, max_batch_size=8,
n_layers_target=80, n_layers_draft=32,
d_model_target=8192, d_model_draft=4096,
n_kv_heads_target=8, n_kv_heads_draft=8,
head_dim=128
)
# target_weights: 35 GB + draft_weights: 3.5 GB + KV: ~42 GB
# Total: ~80.5 GB -> fits with larger batch size!
Memory Budget for Draft+Target Configurations (H100-80GB)
| Target Model | Draft Model | Weight Total | KV Budget (remaining) | Max Batch |
|---|---|---|---|---|
| 70B FP16 | 7B FP16 | 153.5 GB | DOES NOT FIT | N/A |
| 70B INT8 | 7B INT4 | 73.5 GB | 6.5 GB | BS=1-2 |
| 70B INT4 | 7B INT4 | 38.5 GB | 41.5 GB | BS=8-16 |
| 13B FP16 | 1.5B INT4 | 27.0 GB | 53.0 GB | BS=32+ |
| 13B INT4 | 1.5B INT4 | 7.3 GB | 72.7 GB | BS=128+ |
Optimal Draft Model Selection
Self-Speculative Decoding: Quantized Self-Draft
An elegant approach: use the same model as both draft and target, with different quantization levels. The INT4 version of Llama-70B drafts for the FP16 version.
# Self-speculative decoding: same weights, different precision
class SelfSpeculativeDecoder:
def __init__(self, model_name):
# Load target model at full precision
self.target = load_model(model_name, dtype="float16")
# Load draft model: same architecture, INT4 quantized
self.draft = load_model(model_name, quantization="int4-awq")
# Share the embedding and LM head (same vocabulary)
self.draft.embed_tokens = self.target.embed_tokens
self.draft.lm_head = self.target.lm_head
def generate_step(self, input_ids, k=7):
# Phase 1: Draft k tokens with INT4 model
draft_tokens = []
draft_probs = []
current_ids = input_ids
for _ in range(k):
logits = self.draft(current_ids)[:, -1, :]
probs = torch.softmax(logits, dim=-1)
token = torch.multinomial(probs, 1)
draft_tokens.append(token)
draft_probs.append(probs)
current_ids = torch.cat([current_ids, token], dim=-1)
# Phase 2: Verify all k tokens with FP16 model (single forward pass)
all_draft_tokens = torch.cat(draft_tokens, dim=-1)
verify_input = torch.cat([input_ids, all_draft_tokens], dim=-1)
target_logits = self.target(verify_input)
# Phase 3: Accept/reject using standard speculative sampling
accepted = []
for i in range(k):
pos = input_ids.shape[1] + i
p = torch.softmax(target_logits[:, pos-1, :], dim=-1)
q = draft_probs[i]
token = draft_tokens[i]
ratio = p[0, token] / q[0, token]
if torch.rand(1) < ratio:
accepted.append(token)
else:
# Sample from residual
residual = torch.clamp(p - q, min=0)
residual /= residual.sum()
new_token = torch.multinomial(residual, 1)
accepted.append(new_token)
break # Stop accepting after first rejection
return torch.cat(accepted, dim=-1)
Self-speculative decoding with quantized self-draft has the highest acceptance rate among draft model approaches because the draft model shares the exact same knowledge, just at lower precision. The acceptance rate is typically 0.82-0.90 for INT4 self-draft, compared to 0.70-0.78 for a smaller separate draft model at the same memory cost.
Layer-Skipping Draft
An alternative to quantization: use the target model itself but skip layers during drafting. This avoids the memory cost of a separate draft model entirely.
# Layer-skipping self-draft: skip every other layer during draft generation
class LayerSkipDraft:
def __init__(self, model, skip_pattern="even"):
self.model = model
# Skip even-indexed layers (use layers 1, 3, 5, ...)
# or skip based on measured layer sensitivity
if skip_pattern == "even":
self.draft_layers = list(range(1, model.config.num_hidden_layers, 2))
elif skip_pattern == "last_half":
n = model.config.num_hidden_layers
self.draft_layers = list(range(n // 2))
def draft_forward(self, input_ids):
"""Run forward pass through only the draft layers."""
hidden = self.model.embed_tokens(input_ids)
for i in self.draft_layers:
hidden = self.model.layers[i](hidden)
hidden = self.model.norm(hidden)
logits = self.model.lm_head(hidden)
return logits
def target_forward(self, input_ids):
"""Run full forward pass through all layers."""
return self.model(input_ids).logits
# Advantages:
# - Zero additional memory (no separate model weights)
# - Draft is ~2x faster (half the layers)
# - Acceptance rate: ~0.65-0.75 (depends on skip pattern)
# Disadvantages:
# - Lower acceptance rate than quantized self-draft
# - Cannot overlap draft and verify (same weights)
Kernel Scheduling and Execution Overlap
Overlapping Draft Generation with Target Verification
In a pipeline-parallel setup, draft generation for round can begin while the target model verifies round :
# Overlapping execution with CUDA streams
import torch
class PipelinedSpeculativeDecoder:
def __init__(self, target_model, draft_model):
self.target = target_model
self.draft = draft_model
self.draft_stream = torch.cuda.Stream()
self.target_stream = torch.cuda.Stream()
def generate(self, input_ids, max_new_tokens=256, k=7):
generated = input_ids
tokens_generated = 0
# Initial draft round (no overlap possible)
draft_tokens, draft_probs = self.run_draft(generated, k)
while tokens_generated < max_new_tokens:
# Start verification of current draft tokens
with torch.cuda.stream(self.target_stream):
verify_input = torch.cat([generated, draft_tokens], dim=-1)
target_logits = self.target(verify_input)
# Simultaneously start next draft round (speculative)
# We draft assuming all current tokens are accepted
with torch.cuda.stream(self.draft_stream):
speculative_input = torch.cat([generated, draft_tokens], dim=-1)
next_draft_tokens, next_draft_probs = self.run_draft(
speculative_input, k
)
# Synchronize and process verification results
self.target_stream.synchronize()
accepted, new_token = speculative_sample(
target_logits, draft_tokens, draft_probs
)
generated = torch.cat([generated, accepted], dim=-1)
tokens_generated += accepted.shape[1]
# If all k tokens were accepted, the next draft is valid
if accepted.shape[1] == k + 1:
self.draft_stream.synchronize()
draft_tokens = next_draft_tokens
draft_probs = next_draft_probs
else:
# Partial acceptance: discard speculative draft, re-draft
self.draft_stream.synchronize() # Wait for it to finish
draft_tokens, draft_probs = self.run_draft(generated, k)
return generated
def run_draft(self, input_ids, k):
tokens, probs = [], []
current = input_ids
for _ in range(k):
logits = self.draft(current)[:, -1, :]
p = torch.softmax(logits, dim=-1)
tok = torch.multinomial(p, 1)
tokens.append(tok)
probs.append(p)
current = torch.cat([current, tok], dim=-1)
return torch.cat(tokens, dim=-1), probs
Batch Scheduling for Multiple Requests
In a serving system, different requests may be at different stages of the draft-verify cycle:
# vLLM-style scheduling with speculative decoding
class SpeculativeScheduler:
"""
Schedule draft and verify phases across multiple requests.
Key insight: batch all verifications together for efficiency.
"""
def schedule_step(self, active_requests):
# Separate requests by phase
needs_draft = [r for r in active_requests if r.phase == "draft"]
needs_verify = [r for r in active_requests if r.phase == "verify"]
# Draft phase: run draft model for all requests needing drafts
# These are serial per-request (autoregressive) but batched across requests
if needs_draft:
batch_draft_input = collate_requests(needs_draft)
for step in range(self.k):
draft_logits = self.draft_model(batch_draft_input)
new_tokens = sample(draft_logits)
batch_draft_input = append_tokens(batch_draft_input, new_tokens)
store_draft_probs(needs_draft, draft_logits)
# Move to verify phase
for r in needs_draft:
r.phase = "verify"
# Verify phase: batch all verifications into one target forward pass
# This is efficient because verification is a single forward pass per request
if needs_verify:
batch_verify_input = collate_verify_sequences(needs_verify)
target_logits = self.target_model(batch_verify_input)
for r in needs_verify:
n_accepted = speculative_accept(r, target_logits)
r.accepted_tokens = n_accepted
r.phase = "draft" # Start next round
Production Results and Optimal Configuration
End-to-End Speculative Decoding Performance (H100 SXM)
| Configuration | Tokens/sec (BS=1) | Speedup | Acceptance Rate | GPU Memory |
|---|---|---|---|---|
| Llama-2-70B FP16 (no speculation) | 45 tok/s | 1.00x | N/A | 140 GB (2 GPU) |
| 70B FP16 + 7B FP16 draft | 72 tok/s | 1.60x | 0.78 | 153 GB (2 GPU) |
| 70B FP16 + 7B INT4 draft | 88 tok/s | 1.96x | 0.72 | 143 GB (2 GPU) |
| 70B INT4 + 7B INT4 self-draft | 115 tok/s | 2.56x | 0.85 | 38 GB (1 GPU) |
| 70B INT4 + layer-skip draft | 98 tok/s | 2.18x | 0.68 | 35 GB (1 GPU) |
| 13B FP16 + 1.5B INT4 draft | 185 tok/s | 1.85x | 0.62 | 27 GB (1 GPU) |
Optimal k (Draft Length) Selection
The optimal number of draft tokens depends on acceptance rate and relative cost:
# Optimal k analysis for different acceptance rates
def find_optimal_k(alpha, cost_ratio):
"""
alpha: acceptance rate
cost_ratio: t_draft / t_target (how much cheaper the draft is)
For a geometric acceptance model:
optimal k ~ -1 / ln(alpha) when cost_ratio is small
"""
best_k = 1
best_speedup = 0
for k in range(1, 20):
expected_tokens = (1 - alpha**(k+1)) / (1 - alpha)
time_ratio = k * cost_ratio + 1.0 # k drafts + 1 verify
speedup = expected_tokens / time_ratio
if speedup > best_speedup:
best_speedup = speedup
best_k = k
return best_k, best_speedup
# Results:
# alpha=0.85, cost_ratio=0.10 (INT4 self-draft): optimal k=10, speedup=2.8x
# alpha=0.72, cost_ratio=0.10 (INT4 separate): optimal k=7, speedup=2.1x
# alpha=0.72, cost_ratio=0.25 (FP16 separate): optimal k=5, speedup=1.6x
# alpha=0.58, cost_ratio=0.08 (tiny draft): optimal k=5, speedup=1.5x
Optimal Draft Length (k) by Acceptance Rate and Cost Ratio
(optimal k)When Speculative Decoding With Quantized Drafts Fails
Cases where speculative decoding does not help:
1. High batch size serving (BS > 32):
- Target model is already compute-bound, not memory-bound
- Verification of k tokens is NOT free (adds compute)
- Draft model competes for GPU compute resources
- Continuous batching already amortizes decode latency
2. Short outputs (< 20 tokens):
- Overhead of draft-verify protocol dominates
- Warm-up cost of draft model KV cache is not amortized
3. Very high acceptance rate needed (translation, transcription):
- Distribution shift between draft and target causes systematic rejections
- Domain-specific tokens have low draft model coverage
4. Multi-GPU tensor parallel target:
- Communication overhead for verification across GPUs
- Draft model typically runs on a single GPU (communication-free)
- But verification requires all-reduce across GPUs
Speculative decoding provides the largest speedup for single-request, low-batch-size scenarios (interactive chat, code completion). For high-throughput batch serving, continuous batching without speculation is usually more efficient because the GPU is already saturated with compute from the large batch.
Summary
INT4 quantized draft models are the optimal choice for speculative decoding: they provide 2-3x latency reduction per draft token while maintaining 0.70-0.75 acceptance rates, yielding 1.9-2.5x end-to-end speedup over standard decoding. The self-speculative variant (same model at INT4 for drafting, FP16/INT8 for verification) achieves even higher acceptance rates (0.82-0.90) because the draft and target share identical knowledge. The key engineering decisions are: (1) INT4 is the optimal draft quantization level (below INT4, acceptance rate drops faster than speed improves), (2) optimal draft length is typically 5-10 depending on acceptance rate and cost ratio, and (3) the memory budget for co-locating both models on the same GPU determines feasibility. For interactive serving at low batch sizes, quantized speculative decoding is one of the most effective latency reduction techniques available.