Speculative Decoding: Trading Compute for Latency in LLM Inference

Standard LLM decode generates one token per forward pass at 7ms per token on an A100 — that is 140 tokens/sec, bottlenecked by reading 140 GB of model weights from HBM for every single token. Speculative decoding breaks this 1-token-per-pass ceiling by having a small draft model propose 4-8 candidate tokens (at 0.5ms each), then verifying all candidates in a single forward pass of the large model (8ms total). If 3 of the 4 draft tokens are accepted, you generate 4 tokens in 10ms instead of 28ms — a 2.8x speedup with zero quality loss.

The Core Idea

Standard decode: 1 forward pass -> 1 token -> 7 ms per token. Speculative decode: 1 draft pass (~0.5 ms) + 1 verification pass (~8 ms) -> accept 2-4 tokens -> 2.8-4.3 ms per token.

📊

Speculative Decoding Mechanics

Step	Model Used	Cost	Output
1. Draft K tokens	Small model (1-2B)	~0.5 ms per token	K candidate tokens
2. Verify all K in parallel	Large model (70B)	~8 ms (1 forward pass)	Accept first N <= K + generate 1 bonus token
3. Output accepted tokens	--	0 ms	N+1 tokens total

Note: Verification is a single forward pass because the large model processes all K positions in parallel (like prefill).

The key insight: the verification forward pass costs about the same as generating 1 token (both load the same model weights), but it produces N+1 tokens. The effective speedup is (N+1) / (K x t_draft + t_verify).

⚡ Why This Is Free Quality

Speculative decoding is mathematically guaranteed to produce the same distribution as standard sampling from the large model. Accepted tokens match exactly. Rejected tokens are resampled from a corrected distribution. There is zero quality loss — it’s a pure latency optimization.

Acceptance Rate: The Critical Metric

The speedup depends on how many draft tokens the large model accepts:

📊

Acceptance Rate by Task and Draft Model Quality

Task	Draft Model	Target Model	Avg Accepted (K=5)	Effective Speedup
Code generation	CodeLlama-7B	CodeLlama-70B	3.8 of 5	2.7x
Chat (general)	Llama-7B	Llama-70B	2.5 of 5	2.0x
Creative writing	Llama-7B	Llama-70B	1.8 of 5	1.6x
Factual QA	Llama-7B	Llama-70B	3.2 of 5	2.4x
Translation	NLLB-600M	NLLB-3.3B	3.5 of 5	2.6x

Note: Higher acceptance rates on predictable tasks (code, QA). Lower on creative tasks where the draft model diverges.

Effective Tokens per Second (Llama-70B, A100, batch=1)

(tok/s)

Standard decode

14 tok/s

Speculative (K=3, code) 2.7x

38 tok/s

Speculative (K=5, code) 2.6x (K=5 has more waste)

36 tok/s

Speculative (K=3, chat) 2.0x

28 tok/s

Speculative (K=5, creative) 1.6x

22 tok/s

The Math: Optimal K

The expected tokens per speculation round is E[accepted] + 1. The cost per round is K x t_draft + t_verify. The optimal K depends on acceptance probability alpha:

Speedup = (E[accepted] + 1) / (K x t_draft/t_verify + 1)

Where E[accepted] = (1 - alpha^(K+1)) / (1 - alpha) for geometric acceptance.

📊

Optimal K by Acceptance Rate

Acceptance Rate (alpha)	Optimal K	Expected Accepted	Speedup (t_draft/t_verify = 0.07)
0.9 (very aligned)	7-8	6.5	3.8x
0.8 (well aligned)	5-6	4.0	2.8x
0.7 (moderate)	3-4	2.7	2.1x
0.5 (poor alignment)	2-3	1.5	1.4x
0.3 (misaligned)	1-2	0.8	1.1x (barely helps)

Note: When alpha under 0.4, speculation overhead exceeds benefit. Use standard decode instead.

ℹ️ Draft Model Quality Matters Enormously

The draft model must approximate the target model’s distribution well. A 7B draft for a 70B target works because they share vocabulary and training data distribution. A completely different architecture (e.g., LSTM draft for Transformer target) would have alpha under 0.3 — not worth speculating.

Variants

Self-speculative decoding (Medusa, EAGLE): Instead of a separate draft model, add small prediction heads to the target model itself. Each head predicts N-ahead tokens from the last hidden state. Eliminates the need for a separate draft model but adds ~5% parameters.

Prompt lookup decoding: For tasks with repetitive output (structured data, templates), look up candidate tokens from the prompt or a cache. Zero draft cost but only works when output resembles existing text.

Tree speculation: Instead of a single draft sequence, generate a tree of candidates. Verify the entire tree in one forward pass. Accepts more tokens per round at the cost of longer verification.

📊

Speculation Variant Comparison

Variant	Draft Cost	Acceptance Rate	Complexity	Best For
Standard (separate model)	K x t_draft	High (if models aligned)	Medium	General purpose
Medusa (self-draft heads)	~0 extra	Moderate	Low (single model)	When no good draft model exists
EAGLE (feature-based)	~0.05 x t_verify	High	Medium	Best single-model method
Prompt lookup	~0	Task-dependent	Low	Repetitive/structured output
Tree speculation	K x branches x t_draft	Highest (tree covers more)	High	Batch=1 latency-critical

When Speculation Doesn’t Help

📊

When to Skip Speculative Decoding

Scenario	Why Speculation Fails	Better Alternative
Batch size over 8	Target model already bandwidth-saturated	Standard decode (already efficient)
Very short outputs (under 10 tokens)	Setup overhead exceeds savings	Standard decode
No good draft model available	Low acceptance rate (alpha under 0.4)	Medusa/EAGLE heads instead
Streaming (TTFT-sensitive)	Draft adds latency to first token	Standard decode for first chunk, then speculate
Memory-constrained	Draft model consumes additional VRAM	Quantize both models

⚠️ Batch Size Is the Key Variable

Speculative decoding helps most at batch=1 (pure memory-latency-bound decode). As batch size increases, the target model forward pass already amortizes weight loading across more tokens — the marginal benefit of speculation shrinks. At batch=16+, standard decode is often faster because the speculation overhead (running the draft model) exceeds the savings.

Conclusion

Speculative decoding delivers 2-3x latency reduction for LLM decode at small batch sizes, with zero quality loss. The speedup is proportional to the draft-target acceptance rate, which depends on model alignment and task predictability. Use K=3-5 draft tokens with a well-aligned smaller model (same family, 7B draft for 70B target). For scenarios without a good draft model, self-speculative methods (Medusa, EAGLE) add prediction heads to the target model itself. The technique is most impactful at batch=1 and loses benefit as batch size increases beyond 8.