Speculative decoding (Chen et al. 2023, "Accelerating Large Language Model Decoding with Speculative Sampling") uses a small, fast "draft" model to predict 4-8 tokens ahead and then verifies all of them with the large model in a single forward pass. Normal autoregressive decoding generates one token per forward pass. Speculative decoding generates 4-8 tokens per forward pass when the draft model guesses correctly. The output is mathematically identical to standard decoding — no quality tradeoff. Speed improvements of 2-3x are typical in production deployments.
The Fundamental Bottleneck in LLM Decoding
To understand why speculative decoding works, you need to understand why standard decoding is slow.
Autoregressive language model decoding is sequential by design. The model generates token 1, then uses token 1 as part of the input to generate token 2, then uses tokens 1 and 2 to generate token 3, and so on. Each token generation requires a full forward pass through the model.
The bottleneck is not compute — it is memory bandwidth. Loading 70 billion model weights from GPU memory for every single token is slow relative to the GPU's theoretical compute throughput. The model is "memory bandwidth bound," not "compute bound." This means faster compute (bigger GPU) does not help much. The limit is how fast weights can be read from memory.
Speculative decoding breaks the sequential bottleneck by reusing the expensive large model forward pass to verify multiple tokens simultaneously.
How Speculative Decoding Works
The algorithm has two phases per generation step:
Phase 1: Draft. The small draft model generates k candidate tokens autoregressively. The draft model is 10-100x smaller than the target model (e.g., draft = Llama 3 8B, target = Llama 3 70B). Because the draft model is small, generating k tokens is fast.
Phase 2: Verify. The draft tokens are appended to the current sequence and fed to the large target model in a single forward pass. The target model processes all k draft tokens in parallel (because they are inputs, not outputs) and generates its own probabilities for each position.
The acceptance step. At each position, if the target model's chosen token matches the draft model's token, the draft token is accepted. If they diverge at position i, the draft tokens at positions i+1 through k are discarded, and the target model's token at position i is used. Then the process restarts.
When the draft model is accurate, you get k tokens from a single large model forward pass instead of one. Even when the draft model is wrong some of the time, you still get speedup if the acceptance rate is above 50%.
The key mathematical insight from Chen et al. 2023: the acceptance-rejection procedure is designed so that the final output distribution is identical to sampling from the target model alone. It is not an approximation.
What "Identical Output" Means
Speculative decoding does not change the output distribution at all. If you ran the same prompt 1,000 times with standard decoding (same temperature setting), you would get a certain distribution of outputs. Speculative decoding with the same temperature produces the exact same distribution.
This is different from quantization or pruning, which change the model and therefore change its outputs. Speculative decoding is a pure inference optimization — the model is unchanged.
Typical Performance Numbers
Speedup depends on the draft model's accuracy (what fraction of draft tokens the target model accepts) and on sequence length. Longer sequences allow more speculation.
From the original Google paper and subsequent implementations:
- Llama 3 70B with Llama 3 8B as draft: 2.0-2.5x speedup on text generation
- Llama 3 405B with Llama 3 70B as draft: 1.8-2.2x speedup
- Code generation (Python/TypeScript): 2.5-3.5x speedup (code is more predictable, higher acceptance rate)
- Conversational responses: 1.5-2.0x speedup (more unpredictable, lower acceptance rate)
Implementations
Hugging Face Transformers: Supports speculative decoding via the assistant_model parameter:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the large target model
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-70B-Instruct",
torch_dtype="auto",
device_map="auto"
)
# Load the small draft model
draft_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct")
inputs = tokenizer("Write a Python function that", return_tensors="pt").to("cuda")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model, # This enables speculative decoding
max_new_tokens=200
)
vLLM: Native speculative decoding support in production inference server:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct --num-speculative-tokens 5
SGLang: Another production inference framework with speculative decoding support, often achieving better performance than vLLM on some workloads.
Self-Speculative Decoding: No Draft Model Needed
An interesting variant (called "self-speculative decoding" or "layer-skip speculative decoding") skips a subset of the target model's layers during the draft phase, using the full model only during verification. This requires no separate draft model but achieves lower speedup (1.2-1.5x rather than 2-3x).
This approach is useful when you cannot run two models simultaneously due to memory constraints.
When Speculative Decoding Is Most Beneficial
Speculative decoding is most valuable when:
- You are bottlenecked by inference throughput (generating many long outputs)
- You are running the large model at low batch sizes (interactive use, single-user scenarios)
- The draft model and target model are from the same family (better acceptance rates)
- You are generating code or other predictable content
At high batch sizes, standard batched inference already achieves good GPU utilization and speculative decoding provides less benefit. It is primarily an optimization for low-batch interactive inference.
Keep Reading
- Flash Attention Explained — The memory optimization that complements speculative decoding.
- Quantization for Inference Cost — Use a quantized draft model for lower memory overhead during speculation.
- Local LLM vs. API Cost Comparison — How inference optimizations change the economics of self-hosting.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.