Model Routing: How to Cut LLM Costs 50-70% Without Sacrificing Quality

Model routing automatically sends simple queries to cheap models and complex ones to expensive models. With GPT-4o-mini at $0.15/1M tokens vs GPT-4o at $2.50/1M, the savings are substantial.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#model-routing#llm-cost#gpt-4o-mini#ai-efficiency

FIG. ART-19

9 min read

“

Model Routing: How to Cut LLM Costs 50-70% Without Sacrificing Quality

// reading plan

sections

1,112

words

min read

// AI Cost & Efficiency

Semantic Caching: How to Serve LLM Responses Without Calling the API

Semantic caching stores LLM responses and returns them when a new query is semantically similar to a cached one. In customer support applications, hit rates of 15-40% are realistic.

8 min read

// AI Cost & Efficiency

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Model routing is the practice of automatically directing each query to the cheapest model capable of handling it well. GPT-4o-mini at $0.15 per million tokens handles most production tasks nearly as well as GPT-4o at $2.50 per million tokens. Implementing even a simple routing strategy — anything more complex goes to the expensive model, everything else goes to the cheap one — typically reduces costs by 50-70% with less than 2% quality degradation on standard tasks.

The Cost Difference Is Enormous

To understand why routing matters, start with the numbers. As of May 2026:

GPT-4o: $2.50/1M input tokens, $10.00/1M output tokens
GPT-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens
Claude 3.5 Sonnet: $3.00/1M input, $15.00/1M output
Claude 3.5 Haiku: $0.80/1M input, $4.00/1M output

That is roughly a 15-20x price difference between the premium and economy tier models from each provider. An application processing 10 million tokens per day costs $25/day on GPT-4o and $1.50/day on GPT-4o-mini — a difference of $8,600/year for a relatively modest workload.

The question is not whether cheaper models exist but whether they are good enough for your specific queries.

When Cheaper Models Are Good Enough

The research here is consistent. For most classification tasks, extraction tasks, and simple question answering, smaller and cheaper models perform within a few percentage points of the larger ones. Arora et al. (2023) "Ask Me Anything" and the LLM routing literature show that most production queries (estimated 60-80%) are straightforward enough for a small model.

Tasks where cheap models match or nearly match expensive ones:

Text classification (sentiment, intent, topic)
Entity extraction from structured text
Simple question answering with provided context
Text formatting and reformatting
Summarization of short documents
Simple code completion and explanation

Tasks where expensive models pull ahead:

Multi-step reasoning chains
Complex code generation (especially with multiple interdependencies)
Long-context synthesis across multiple documents
Nuanced writing with specific style requirements
Math and logic problems
Anything requiring persistent context management

Routing Strategy 1: Rule-Based Routing

The simplest routing strategy uses hard rules to decide which model handles a request.

type ModelTier = "cheap" | "expensive";

function routeRequest(query: string, context?: string): ModelTier {
  // Route to cheap model for short, simple queries
  if (query.length < 200 && !hasCodeBlock(query)) {
    return "cheap";
  }

  // Route to expensive model for code
  if (hasCodeBlock(query) || isCodeRelated(query)) {
    return "expensive";
  }

  // Route to expensive model for long context
  if (context && context.length > 8000) {
    return "expensive";
  }

  // Default to cheap
  return "cheap";
}

function hasCodeBlock(text: string): boolean {
  return text.includes("```") || text.includes("def ") || text.includes("function ");
}

function isCodeRelated(text: string): boolean {
  const codeKeywords = ["debug", "refactor", "implement", "algorithm", "function", "class", "error"];
  return codeKeywords.some(kw => text.toLowerCase().includes(kw));
}

Rule-based routing is easy to implement, predictable, and requires no additional API calls. The downside is that rules are brittle and require manual maintenance as your use case evolves.

Routing Strategy 2: Classifier-Based Routing

A classifier-based router uses a small, fast model to predict which model should handle the request. This is the approach taken by Martian (used by major enterprises) and several academic routing systems.

The classifier is trained on examples of queries labeled with the "correct" model tier. You create the training data by running a sample of your production queries through both models, comparing quality, and labeling which model was sufficient for each.

from openai import OpenAI

client = OpenAI()

def route_with_classifier(query: str) -> str:
    # Use a tiny, fast model as the classifier
    routing_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """You are a query complexity classifier.
Classify the user query as:
- SIMPLE: basic question, extraction, classification, or formatting
- COMPLEX: reasoning, code generation, multi-step analysis, or synthesis

Respond with only SIMPLE or COMPLEX."""
        }, {
            "role": "user",
            "content": query
        }],
        max_tokens=10
    )

    complexity = routing_response.choices[0].message.content.strip()
    return "gpt-4o-mini" if complexity == "SIMPLE" else "gpt-4o"

Using GPT-4o-mini to route (at $0.15/1M tokens) adds negligible cost while providing much more accurate routing than hand-written rules.

Routing Strategy 3: Cascade Routing

Cascade routing tries the cheap model first and escalates to the expensive model only when the cheap model is uncertain about its answer.

The challenge is determining when the cheap model is uncertain. One approach: ask the cheap model to rate its own confidence after generating an answer. If it rates itself below a threshold, re-run with the expensive model.

def cascade_route(query: str) -> str:
    # First, try cheap model
    cheap_response = call_model("gpt-4o-mini", query + "

Also rate your confidence 1-10.")

    confidence = extract_confidence_score(cheap_response)

    if confidence >= 8:
        return cheap_response  # Keep cheap model answer

    # Escalate to expensive model
    return call_model("gpt-4o", query)

Cascade routing is elegant but has two problems: it adds latency (two sequential model calls on uncertain queries) and self-reported confidence scores from LLMs are not well-calibrated. Use it when latency is not a concern and you want maximum simplicity.

Implementing With OpenRouter or LiteLLM

Rather than building your own routing infrastructure, use an existing proxy:

OpenRouter (openrouter.ai) provides a unified API across 100+ models and has built-in model routing features. Route to "auto" and OpenRouter selects the appropriate model, or define explicit routing rules in your request.

LiteLLM is an open source library that standardizes API calls across providers and supports router patterns including least-cost routing and fallback chains.

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "cheap", "litellm_params": {"model": "gpt-4o-mini"}},
        {"model_name": "expensive", "litellm_params": {"model": "gpt-4o"}}
    ]
)

# LiteLLM handles routing, fallbacks, and retries
response = router.completion(
    model="cheap",  # or "expensive" based on your routing logic
    messages=[{"role": "user", "content": query}]
)

Measuring Routing Quality

After implementing routing, measure whether it is working correctly by sampling routed requests and comparing quality:

Log which model handled each request
Run a random sample (5%) of cheap-model responses through LM-as-judge against expensive-model responses
Calculate the percentage where the judge would have preferred the expensive model
If that percentage is low (under 10%), your routing is working well
If it is high (over 20%), tighten your routing rules to send more to the expensive tier

The goal is not to minimize the expensive-model percentage but to send queries to the expensive model if and only if they need it.

Keep Reading

Cutting LLM API Costs: The Complete Guide — The full framework for reducing LLM spend without sacrificing quality.
LLM API Pricing Comparison 2026 — Current pricing across all major providers to inform your routing thresholds.
Semantic Caching LLM Responses — Combine with routing for maximum cost reduction.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Model Routing: How to Cut LLM Costs 50-70% Without Sacrificing Quality

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

The Cost Difference Is Enormous

When Cheaper Models Are Good Enough

Routing Strategy 1: Rule-Based Routing

Routing Strategy 2: Classifier-Based Routing

Routing Strategy 3: Cascade Routing

Implementing With OpenRouter or LiteLLM

Measuring Routing Quality

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Cutting LLM API Costs by 50%+: Every Technique That Works in 2026

Model Routing: How to Cut LLM Costs 50-70% Without Sacrificing Quality

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

The Cost Difference Is Enormous

When Cheaper Models Are Good Enough

Routing Strategy 1: Rule-Based Routing

Routing Strategy 2: Classifier-Based Routing

Routing Strategy 3: Cascade Routing

Implementing With OpenRouter or LiteLLM

Measuring Routing Quality

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Cutting LLM API Costs by 50%+: Every Technique That Works in 2026

The workspace your team
actually needs