Model routing is the practice of automatically directing each query to the cheapest model capable of handling it well. GPT-4o-mini at $0.15 per million tokens handles most production tasks nearly as well as GPT-4o at $2.50 per million tokens. Implementing even a simple routing strategy — anything more complex goes to the expensive model, everything else goes to the cheap one — typically reduces costs by 50-70% with less than 2% quality degradation on standard tasks.
The Cost Difference Is Enormous
To understand why routing matters, start with the numbers. As of May 2026:
- GPT-4o: $2.50/1M input tokens, $10.00/1M output tokens
- GPT-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens
- Claude 3.5 Sonnet: $3.00/1M input, $15.00/1M output
- Claude 3.5 Haiku: $0.80/1M input, $4.00/1M output
That is roughly a 15-20x price difference between the premium and economy tier models from each provider. An application processing 10 million tokens per day costs $25/day on GPT-4o and $1.50/day on GPT-4o-mini — a difference of $8,600/year for a relatively modest workload.
The question is not whether cheaper models exist but whether they are good enough for your specific queries.
When Cheaper Models Are Good Enough
The research here is consistent. For most classification tasks, extraction tasks, and simple question answering, smaller and cheaper models perform within a few percentage points of the larger ones. Arora et al. (2023) "Ask Me Anything" and the LLM routing literature show that most production queries (estimated 60-80%) are straightforward enough for a small model.
Tasks where cheap models match or nearly match expensive ones:
- Text classification (sentiment, intent, topic)
- Entity extraction from structured text
- Simple question answering with provided context
- Text formatting and reformatting
- Summarization of short documents
- Simple code completion and explanation
Tasks where expensive models pull ahead:
- Multi-step reasoning chains
- Complex code generation (especially with multiple interdependencies)
- Long-context synthesis across multiple documents
- Nuanced writing with specific style requirements
- Math and logic problems
- Anything requiring persistent context management
Routing Strategy 1: Rule-Based Routing
The simplest routing strategy uses hard rules to decide which model handles a request.
type ModelTier = "cheap" | "expensive";
function routeRequest(query: string, context?: string): ModelTier {
// Route to cheap model for short, simple queries
if (query.length < 200 && !hasCodeBlock(query)) {
return "cheap";
}
// Route to expensive model for code
if (hasCodeBlock(query) || isCodeRelated(query)) {
return "expensive";
}
// Route to expensive model for long context
if (context && context.length > 8000) {
return "expensive";
}
// Default to cheap
return "cheap";
}
function hasCodeBlock(text: string): boolean {
return text.includes("```") || text.includes("def ") || text.includes("function ");
}
function isCodeRelated(text: string): boolean {
const codeKeywords = ["debug", "refactor", "implement", "algorithm", "function", "class", "error"];
return codeKeywords.some(kw => text.toLowerCase().includes(kw));
}
Rule-based routing is easy to implement, predictable, and requires no additional API calls. The downside is that rules are brittle and require manual maintenance as your use case evolves.
Routing Strategy 2: Classifier-Based Routing
A classifier-based router uses a small, fast model to predict which model should handle the request. This is the approach taken by Martian (used by major enterprises) and several academic routing systems.
The classifier is trained on examples of queries labeled with the "correct" model tier. You create the training data by running a sample of your production queries through both models, comparing quality, and labeling which model was sufficient for each.
from openai import OpenAI
client = OpenAI()
def route_with_classifier(query: str) -> str:
# Use a tiny, fast model as the classifier
routing_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """You are a query complexity classifier.
Classify the user query as:
- SIMPLE: basic question, extraction, classification, or formatting
- COMPLEX: reasoning, code generation, multi-step analysis, or synthesis
Respond with only SIMPLE or COMPLEX."""
}, {
"role": "user",
"content": query
}],
max_tokens=10
)
complexity = routing_response.choices[0].message.content.strip()
return "gpt-4o-mini" if complexity == "SIMPLE" else "gpt-4o"
Using GPT-4o-mini to route (at $0.15/1M tokens) adds negligible cost while providing much more accurate routing than hand-written rules.
Routing Strategy 3: Cascade Routing
Cascade routing tries the cheap model first and escalates to the expensive model only when the cheap model is uncertain about its answer.
The challenge is determining when the cheap model is uncertain. One approach: ask the cheap model to rate its own confidence after generating an answer. If it rates itself below a threshold, re-run with the expensive model.
def cascade_route(query: str) -> str:
# First, try cheap model
cheap_response = call_model("gpt-4o-mini", query + "
Also rate your confidence 1-10.")
confidence = extract_confidence_score(cheap_response)
if confidence >= 8:
return cheap_response # Keep cheap model answer
# Escalate to expensive model
return call_model("gpt-4o", query)
Cascade routing is elegant but has two problems: it adds latency (two sequential model calls on uncertain queries) and self-reported confidence scores from LLMs are not well-calibrated. Use it when latency is not a concern and you want maximum simplicity.
Implementing With OpenRouter or LiteLLM
Rather than building your own routing infrastructure, use an existing proxy:
OpenRouter (openrouter.ai) provides a unified API across 100+ models and has built-in model routing features. Route to "auto" and OpenRouter selects the appropriate model, or define explicit routing rules in your request.
LiteLLM is an open source library that standardizes API calls across providers and supports router patterns including least-cost routing and fallback chains.
from litellm import Router
router = Router(
model_list=[
{"model_name": "cheap", "litellm_params": {"model": "gpt-4o-mini"}},
{"model_name": "expensive", "litellm_params": {"model": "gpt-4o"}}
]
)
# LiteLLM handles routing, fallbacks, and retries
response = router.completion(
model="cheap", # or "expensive" based on your routing logic
messages=[{"role": "user", "content": query}]
)
Measuring Routing Quality
After implementing routing, measure whether it is working correctly by sampling routed requests and comparing quality:
- Log which model handled each request
- Run a random sample (5%) of cheap-model responses through LM-as-judge against expensive-model responses
- Calculate the percentage where the judge would have preferred the expensive model
- If that percentage is low (under 10%), your routing is working well
- If it is high (over 20%), tighten your routing rules to send more to the expensive tier
The goal is not to minimize the expensive-model percentage but to send queries to the expensive model if and only if they need it.
Keep Reading
- Cutting LLM API Costs: The Complete Guide — The full framework for reducing LLM spend without sacrificing quality.
- LLM API Pricing Comparison 2026 — Current pricing across all major providers to inform your routing thresholds.
- Semantic Caching LLM Responses — Combine with routing for maximum cost reduction.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.