Standard LLM decode generates one token per forward pass at 7ms per token on an A100 — that is 140 tokens/sec, bottlenecked by reading 140 GB of model weights from HBM for every single token. Speculative decoding breaks this 1-token-per-pass ceiling by having a small draft model propose 4-8 candidate tokens (at 0.5ms each), then verifying all candidates in a single forward pass of the large model (8ms total). If 3 of the 4 draft tokens are accepted, you generate 4 tokens in 10ms instead of 28ms — a 2.8x speedup with zero quality loss.

The Core Idea

Standard decode: 1 forward pass -> 1 token -> 7 ms per token. Speculative decode: 1 draft pass (~0.5 ms) + 1 verification pass (~8 ms) -> accept 2-4 tokens -> 2.8-4.3 ms per token.

📊

Speculative Decoding Mechanics

StepModel UsedCostOutput
1. Draft K tokens Small model (1-2B) ~0.5 ms per token K candidate tokens
2. Verify all K in parallel Large model (70B) ~8 ms (1 forward pass) Accept first N <= K + generate 1 bonus token
3. Output accepted tokens -- 0 ms N+1 tokens total
Note: Verification is a single forward pass because the large model processes all K positions in parallel (like prefill).

The key insight: the verification forward pass costs about the same as generating 1 token (both load the same model weights), but it produces N+1 tokens. The effective speedup is (N+1) / (K x t_draft + t_verify).

Why This Is Free Quality

Speculative decoding is mathematically guaranteed to produce the same distribution as standard sampling from the large model. Accepted tokens match exactly. Rejected tokens are resampled from a corrected distribution. There is zero quality loss — it’s a pure latency optimization.

Acceptance Rate: The Critical Metric

The speedup depends on how many draft tokens the large model accepts:

📊

Acceptance Rate by Task and Draft Model Quality

TaskDraft ModelTarget ModelAvg Accepted (K=5)Effective Speedup
Code generation CodeLlama-7B CodeLlama-70B 3.8 of 5 2.7x
Chat (general) Llama-7B Llama-70B 2.5 of 5 2.0x
Creative writing Llama-7B Llama-70B 1.8 of 5 1.6x
Factual QA Llama-7B Llama-70B 3.2 of 5 2.4x
Translation NLLB-600M NLLB-3.3B 3.5 of 5 2.6x
Note: Higher acceptance rates on predictable tasks (code, QA). Lower on creative tasks where the draft model diverges.

Effective Tokens per Second (Llama-70B, A100, batch=1)

(tok/s)
Standard decode
14 tok/s
Speculative (K=3, code) 2.7x
38 tok/s
Speculative (K=5, code) 2.6x (K=5 has more waste)
36 tok/s
Speculative (K=3, chat) 2.0x
28 tok/s
Speculative (K=5, creative) 1.6x
22 tok/s

The Math: Optimal K

The expected tokens per speculation round is E[accepted] + 1. The cost per round is K x t_draft + t_verify. The optimal K depends on acceptance probability alpha:

Speedup = (E[accepted] + 1) / (K x t_draft/t_verify + 1)

Where E[accepted] = (1 - alpha^(K+1)) / (1 - alpha) for geometric acceptance.

📊

Optimal K by Acceptance Rate

Acceptance Rate (alpha)Optimal KExpected AcceptedSpeedup (t_draft/t_verify = 0.07)
0.9 (very aligned) 7-8 6.5 3.8x
0.8 (well aligned) 5-6 4.0 2.8x
0.7 (moderate) 3-4 2.7 2.1x
0.5 (poor alignment) 2-3 1.5 1.4x
0.3 (misaligned) 1-2 0.8 1.1x (barely helps)
Note: When alpha under 0.4, speculation overhead exceeds benefit. Use standard decode instead.
ℹ️ Draft Model Quality Matters Enormously

The draft model must approximate the target model’s distribution well. A 7B draft for a 70B target works because they share vocabulary and training data distribution. A completely different architecture (e.g., LSTM draft for Transformer target) would have alpha under 0.3 — not worth speculating.

Variants

Self-speculative decoding (Medusa, EAGLE): Instead of a separate draft model, add small prediction heads to the target model itself. Each head predicts N-ahead tokens from the last hidden state. Eliminates the need for a separate draft model but adds ~5% parameters.

Prompt lookup decoding: For tasks with repetitive output (structured data, templates), look up candidate tokens from the prompt or a cache. Zero draft cost but only works when output resembles existing text.

Tree speculation: Instead of a single draft sequence, generate a tree of candidates. Verify the entire tree in one forward pass. Accepts more tokens per round at the cost of longer verification.

📊

Speculation Variant Comparison

VariantDraft CostAcceptance RateComplexityBest For
Standard (separate model) K x t_draft High (if models aligned) Medium General purpose
Medusa (self-draft heads) ~0 extra Moderate Low (single model) When no good draft model exists
EAGLE (feature-based) ~0.05 x t_verify High Medium Best single-model method
Prompt lookup ~0 Task-dependent Low Repetitive/structured output
Tree speculation K x branches x t_draft Highest (tree covers more) High Batch=1 latency-critical

When Speculation Doesn’t Help

📊

When to Skip Speculative Decoding

ScenarioWhy Speculation FailsBetter Alternative
Batch size over 8 Target model already bandwidth-saturated Standard decode (already efficient)
Very short outputs (under 10 tokens) Setup overhead exceeds savings Standard decode
No good draft model available Low acceptance rate (alpha under 0.4) Medusa/EAGLE heads instead
Streaming (TTFT-sensitive) Draft adds latency to first token Standard decode for first chunk, then speculate
Memory-constrained Draft model consumes additional VRAM Quantize both models
⚠️ Batch Size Is the Key Variable

Speculative decoding helps most at batch=1 (pure memory-latency-bound decode). As batch size increases, the target model forward pass already amortizes weight loading across more tokens — the marginal benefit of speculation shrinks. At batch=16+, standard decode is often faster because the speculation overhead (running the draft model) exceeds the savings.

Conclusion

Speculative decoding delivers 2-3x latency reduction for LLM decode at small batch sizes, with zero quality loss. The speedup is proportional to the draft-target acceptance rate, which depends on model alignment and task predictability. Use K=3-5 draft tokens with a well-aligned smaller model (same family, 7B draft for 70B target). For scenarios without a good draft model, self-speculative methods (Medusa, EAGLE) add prediction heads to the target model itself. The technique is most impactful at batch=1 and loses benefit as batch size increases beyond 8.