Speculative Decoding: How to Get 3x LLM Speed With a Smaller Draft Model

Speculative decoding uses a small fast model to draft multiple tokens and a large model to verify them in parallel, achieving 1.5-3x speedups without changing output distribution.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 26, 2026

9 min read

// tags

#speculative-decoding#inference#speed#draft-model#latency

FIG. ART-30

9 min read

“

Speculative Decoding: How to Get 3x LLM Speed With a Smaller Draft Model

// reading plan

sections

514

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Reducing ML Model Serving Latency for Production

Why It Preserves the Output Distribution

The acceptance criterion is derived from rejection sampling theory. When draft and target agree (p_draft ≈ p_target), tokens are almost always accepted. When they disagree, tokens are rejected and resampled from p_target. The mathematical guarantee is that the final output distribution is identical to what the large model would have produced alone - no approximation.

import torch
import torch.nn.functional as F

def speculative_sample(draft_model, target_model, input_ids, k=5):
    # Generate k draft tokens with small model
    draft_tokens = []
    draft_logprobs = []
    current_ids = input_ids.clone()

    for _ in range(k):
        with torch.no_grad():
            logits = draft_model(current_ids).logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        token = torch.multinomial(probs, 1)
        draft_tokens.append(token)
        draft_logprobs.append(probs[0, token.item()])
        current_ids = torch.cat([current_ids, token], dim=1)

    # Verify with large model in one pass
    with torch.no_grad():
        all_logits = target_model(current_ids).logits
    target_probs = F.softmax(all_logits[:, input_ids.shape[1]-1:-1, :], dim=-1)

    # Accept/reject each draft token
    accepted = []
    for i, (tok, draft_p) in enumerate(zip(draft_tokens, draft_logprobs)):
        target_p = target_probs[0, i, tok.item()]
        accept_prob = min(1.0, (target_p / draft_p).item())
        if torch.rand(1).item() < accept_prob:
            accepted.append(tok)
        else:
            break

    return torch.cat([input_ids] + accepted, dim=1)

Medusa: Multi-Head Parallel Prediction

Medusa (arXiv:2401.10774) adds multiple decoding heads to the target model itself, each predicting tokens at different future positions. This eliminates the need for a separate draft model. A tree-based verification scheme accepts the best valid continuation from a structured set of candidates. Medusa achieves 2-3x speedup with no additional model needed.

EAGLE: Context-Aware Drafting

EAGLE improves over Medusa by feeding the target model's hidden states to the draft head, making the draft context-aware. This increases acceptance rates and pushes speedups toward 3x on Llama 2 70B.

When Speculative Decoding Wins

Speculative decoding helps most when inference is memory-bandwidth-bound (large batch size = 1 or small batches), the draft model is fast and cheap (7B draft for 70B target), and draft acceptance rate is high (>70%). It is less helpful for large batches where the target model is already compute-bound.

Speculative Decoding: How to Get 3x LLM Speed With a Smaller Draft Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Reducing ML Model Serving Latency for Production

The Autoregressive Bottleneck

The Speculative Decoding Insight

Why It Preserves the Output Distribution

Medusa: Multi-Head Parallel Prediction

EAGLE: Context-Aware Drafting

When Speculative Decoding Wins

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Together AI: Run 200+ Open Models via OpenAI-Compatible API

Speculative Decoding: How to Get 3x LLM Speed With a Smaller Draft Model

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Reducing ML Model Serving Latency for Production

The Autoregressive Bottleneck

The Speculative Decoding Insight

Why It Preserves the Output Distribution

Medusa: Multi-Head Parallel Prediction

EAGLE: Context-Aware Drafting

When Speculative Decoding Wins

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Together AI: Run 200+ Open Models via OpenAI-Compatible API

The workspace your team
actually needs