Speculative Decoding: How to Make LLM Inference 2-3x Faster With Identical Output

Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them with the large model in one parallel pass. The result is 2-3x faster inference with bit-identical output quality.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#speculative-decoding#llm-inference#ai-optimization#vllm

FIG. ART-30

9 min read

“

Speculative Decoding: How to Make LLM Inference 2-3x Faster With Identical Output

// reading plan

sections

954

words

min read

// AI Cost & Efficiency

Semantic Caching: How to Serve LLM Responses Without Calling the API

Semantic caching stores LLM responses and returns them when a new query is semantically similar to a cached one. In customer support applications, hit rates of 15-40% are realistic.

8 min read

// AI Cost & Efficiency

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Speculative decoding (Chen et al. 2023, "Accelerating Large Language Model Decoding with Speculative Sampling") uses a small, fast "draft" model to predict 4-8 tokens ahead and then verifies all of them with the large model in a single forward pass. Normal autoregressive decoding generates one token per forward pass. Speculative decoding generates 4-8 tokens per forward pass when the draft model guesses correctly. The output is mathematically identical to standard decoding — no quality tradeoff. Speed improvements of 2-3x are typical in production deployments.

The Fundamental Bottleneck in LLM Decoding

To understand why speculative decoding works, you need to understand why standard decoding is slow.

Autoregressive language model decoding is sequential by design. The model generates token 1, then uses token 1 as part of the input to generate token 2, then uses tokens 1 and 2 to generate token 3, and so on. Each token generation requires a full forward pass through the model.

The bottleneck is not compute — it is memory bandwidth. Loading 70 billion model weights from GPU memory for every single token is slow relative to the GPU's theoretical compute throughput. The model is "memory bandwidth bound," not "compute bound." This means faster compute (bigger GPU) does not help much. The limit is how fast weights can be read from memory.

Speculative decoding breaks the sequential bottleneck by reusing the expensive large model forward pass to verify multiple tokens simultaneously.

How Speculative Decoding Works

The algorithm has two phases per generation step:

Phase 1: Draft. The small draft model generates k candidate tokens autoregressively. The draft model is 10-100x smaller than the target model (e.g., draft = Llama 3 8B, target = Llama 3 70B). Because the draft model is small, generating k tokens is fast.

Phase 2: Verify. The draft tokens are appended to the current sequence and fed to the large target model in a single forward pass. The target model processes all k draft tokens in parallel (because they are inputs, not outputs) and generates its own probabilities for each position.

The acceptance step. At each position, if the target model's chosen token matches the draft model's token, the draft token is accepted. If they diverge at position i, the draft tokens at positions i+1 through k are discarded, and the target model's token at position i is used. Then the process restarts.

When the draft model is accurate, you get k tokens from a single large model forward pass instead of one. Even when the draft model is wrong some of the time, you still get speedup if the acceptance rate is above 50%.

The key mathematical insight from Chen et al. 2023: the acceptance-rejection procedure is designed so that the final output distribution is identical to sampling from the target model alone. It is not an approximation.

What "Identical Output" Means

Speculative decoding does not change the output distribution at all. If you ran the same prompt 1,000 times with standard decoding (same temperature setting), you would get a certain distribution of outputs. Speculative decoding with the same temperature produces the exact same distribution.

This is different from quantization or pruning, which change the model and therefore change its outputs. Speculative decoding is a pure inference optimization — the model is unchanged.

Typical Performance Numbers

Speedup depends on the draft model's accuracy (what fraction of draft tokens the target model accepts) and on sequence length. Longer sequences allow more speculation.

From the original Google paper and subsequent implementations:

Llama 3 70B with Llama 3 8B as draft: 2.0-2.5x speedup on text generation
Llama 3 405B with Llama 3 70B as draft: 1.8-2.2x speedup
Code generation (Python/TypeScript): 2.5-3.5x speedup (code is more predictable, higher acceptance rate)
Conversational responses: 1.5-2.0x speedup (more unpredictable, lower acceptance rate)

Implementations

Hugging Face Transformers: Supports speculative decoding via the assistant_model parameter:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the large target model
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-70B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load the small draft model
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct")

inputs = tokenizer("Write a Python function that", return_tensors="pt").to("cuda")

outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # This enables speculative decoding
    max_new_tokens=200
)

vLLM: Native speculative decoding support in production inference server:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3.1-70B-Instruct   --speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct   --num-speculative-tokens 5

SGLang: Another production inference framework with speculative decoding support, often achieving better performance than vLLM on some workloads.

Self-Speculative Decoding: No Draft Model Needed

An interesting variant (called "self-speculative decoding" or "layer-skip speculative decoding") skips a subset of the target model's layers during the draft phase, using the full model only during verification. This requires no separate draft model but achieves lower speedup (1.2-1.5x rather than 2-3x).

This approach is useful when you cannot run two models simultaneously due to memory constraints.

When Speculative Decoding Is Most Beneficial

Speculative decoding is most valuable when:

You are bottlenecked by inference throughput (generating many long outputs)
You are running the large model at low batch sizes (interactive use, single-user scenarios)
The draft model and target model are from the same family (better acceptance rates)
You are generating code or other predictable content

At high batch sizes, standard batched inference already achieves good GPU utilization and speculative decoding provides less benefit. It is primarily an optimization for low-batch interactive inference.

Keep Reading

Flash Attention Explained — The memory optimization that complements speculative decoding.
Quantization for Inference Cost — Use a quantized draft model for lower memory overhead during speculation.
Local LLM vs. API Cost Comparison — How inference optimizations change the economics of self-hosting.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Speculative Decoding: How to Make LLM Inference 2-3x Faster With Identical Output

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

The Fundamental Bottleneck in LLM Decoding

How Speculative Decoding Works

What "Identical Output" Means

Typical Performance Numbers

Implementations

Self-Speculative Decoding: No Draft Model Needed

When Speculative Decoding Is Most Beneficial

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Cutting LLM API Costs by 50%+: Every Technique That Works in 2026

Speculative Decoding: How to Make LLM Inference 2-3x Faster With Identical Output

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

The Fundamental Bottleneck in LLM Decoding

How Speculative Decoding Works

What "Identical Output" Means

Typical Performance Numbers

Implementations

Self-Speculative Decoding: No Draft Model Needed

When Speculative Decoding Is Most Beneficial

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Cutting LLM API Costs by 50%+: Every Technique That Works in 2026

The workspace your team
actually needs