Standard LLM decode generates one token per forward pass at 7ms per token on an A100 — that is 140 tokens/sec, bottlenecked by reading 140 GB of model weights from HBM for every single token. Speculative decoding breaks this 1-token-per-pass ceiling by having a small draft model propose 4-8 candidate tokens (at 0.5ms each), then verifying all candidates in a single forward pass of the large model (8ms total). If 3 of the 4 draft tokens are accepted, you generate 4 tokens in 10ms instead of 28ms — a 2.8x speedup with zero quality loss.
The Core Idea
Standard decode: 1 forward pass -> 1 token -> 7 ms per token. Speculative decode: 1 draft pass (~0.5 ms) + 1 verification pass (~8 ms) -> accept 2-4 tokens -> 2.8-4.3 ms per token.
Speculative Decoding Mechanics
| Step | Model Used | Cost | Output |
|---|---|---|---|
| 1. Draft K tokens | Small model (1-2B) | ~0.5 ms per token | K candidate tokens |
| 2. Verify all K in parallel | Large model (70B) | ~8 ms (1 forward pass) | Accept first N <= K + generate 1 bonus token |
| 3. Output accepted tokens | -- | 0 ms | N+1 tokens total |
The key insight: the verification forward pass costs about the same as generating 1 token (both load the same model weights), but it produces N+1 tokens. The effective speedup is (N+1) / (K x t_draft + t_verify).
Speculative decoding is mathematically guaranteed to produce the same distribution as standard sampling from the large model. Accepted tokens match exactly. Rejected tokens are resampled from a corrected distribution. There is zero quality loss — it’s a pure latency optimization.
Acceptance Rate: The Critical Metric
The speedup depends on how many draft tokens the large model accepts:
Acceptance Rate by Task and Draft Model Quality
| Task | Draft Model | Target Model | Avg Accepted (K=5) | Effective Speedup |
|---|---|---|---|---|
| Code generation | CodeLlama-7B | CodeLlama-70B | 3.8 of 5 | 2.7x |
| Chat (general) | Llama-7B | Llama-70B | 2.5 of 5 | 2.0x |
| Creative writing | Llama-7B | Llama-70B | 1.8 of 5 | 1.6x |
| Factual QA | Llama-7B | Llama-70B | 3.2 of 5 | 2.4x |
| Translation | NLLB-600M | NLLB-3.3B | 3.5 of 5 | 2.6x |
Effective Tokens per Second (Llama-70B, A100, batch=1)
(tok/s)The Math: Optimal K
The expected tokens per speculation round is E[accepted] + 1. The cost per round is K x t_draft + t_verify. The optimal K depends on acceptance probability alpha:
Speedup = (E[accepted] + 1) / (K x t_draft/t_verify + 1)
Where E[accepted] = (1 - alpha^(K+1)) / (1 - alpha) for geometric acceptance.
Optimal K by Acceptance Rate
| Acceptance Rate (alpha) | Optimal K | Expected Accepted | Speedup (t_draft/t_verify = 0.07) |
|---|---|---|---|
| 0.9 (very aligned) | 7-8 | 6.5 | 3.8x |
| 0.8 (well aligned) | 5-6 | 4.0 | 2.8x |
| 0.7 (moderate) | 3-4 | 2.7 | 2.1x |
| 0.5 (poor alignment) | 2-3 | 1.5 | 1.4x |
| 0.3 (misaligned) | 1-2 | 0.8 | 1.1x (barely helps) |
The draft model must approximate the target model’s distribution well. A 7B draft for a 70B target works because they share vocabulary and training data distribution. A completely different architecture (e.g., LSTM draft for Transformer target) would have alpha under 0.3 — not worth speculating.
Variants
Self-speculative decoding (Medusa, EAGLE): Instead of a separate draft model, add small prediction heads to the target model itself. Each head predicts N-ahead tokens from the last hidden state. Eliminates the need for a separate draft model but adds ~5% parameters.
Prompt lookup decoding: For tasks with repetitive output (structured data, templates), look up candidate tokens from the prompt or a cache. Zero draft cost but only works when output resembles existing text.
Tree speculation: Instead of a single draft sequence, generate a tree of candidates. Verify the entire tree in one forward pass. Accepts more tokens per round at the cost of longer verification.
Speculation Variant Comparison
| Variant | Draft Cost | Acceptance Rate | Complexity | Best For |
|---|---|---|---|---|
| Standard (separate model) | K x t_draft | High (if models aligned) | Medium | General purpose |
| Medusa (self-draft heads) | ~0 extra | Moderate | Low (single model) | When no good draft model exists |
| EAGLE (feature-based) | ~0.05 x t_verify | High | Medium | Best single-model method |
| Prompt lookup | ~0 | Task-dependent | Low | Repetitive/structured output |
| Tree speculation | K x branches x t_draft | Highest (tree covers more) | High | Batch=1 latency-critical |
When Speculation Doesn’t Help
When to Skip Speculative Decoding
| Scenario | Why Speculation Fails | Better Alternative |
|---|---|---|
| Batch size over 8 | Target model already bandwidth-saturated | Standard decode (already efficient) |
| Very short outputs (under 10 tokens) | Setup overhead exceeds savings | Standard decode |
| No good draft model available | Low acceptance rate (alpha under 0.4) | Medusa/EAGLE heads instead |
| Streaming (TTFT-sensitive) | Draft adds latency to first token | Standard decode for first chunk, then speculate |
| Memory-constrained | Draft model consumes additional VRAM | Quantize both models |
Speculative decoding helps most at batch=1 (pure memory-latency-bound decode). As batch size increases, the target model forward pass already amortizes weight loading across more tokens — the marginal benefit of speculation shrinks. At batch=16+, standard decode is often faster because the speculation overhead (running the draft model) exceeds the savings.
Conclusion
Speculative decoding delivers 2-3x latency reduction for LLM decode at small batch sizes, with zero quality loss. The speedup is proportional to the draft-target acceptance rate, which depends on model alignment and task predictability. Use K=3-5 draft tokens with a well-aligned smaller model (same family, 7B draft for 70B target). For scenarios without a good draft model, self-speculative methods (Medusa, EAGLE) add prediction heads to the target model itself. The technique is most impactful at batch=1 and loses benefit as batch size increases beyond 8.