LLM API costs are the fastest-growing infrastructure line item for AI-powered products in 2026. The techniques for cutting those costs without degrading quality are well-established but rarely documented clearly in one place. The short answer: model routing (using cheap models for simple tasks), prompt caching, batch processing, and switching high-volume tasks to local models. A startup I worked with reduced their monthly bill from $800 to $320 by applying four of these techniques systematically over two weeks.
This guide covers every technique that works, with real pricing numbers and specific implementation guidance.
Current LLM API Pricing (May 2026)
Understanding the price spread is the foundation of cost reduction. The difference between the most expensive and cheapest capable models is roughly 100x.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Deepseek V3 | $0.14 | $0.28 |
| Deepseek R1 | $0.55 | $2.19 |
Source: Official pricing pages for each provider, May 2026. Prices fluctuate; verify before budgeting.
The critical insight from this table: Gemini 1.5 Flash at $0.075/$0.30 is the cheapest capable model available, roughly 33x cheaper than GPT-4o for input and 33x cheaper for output. For tasks where Gemini Flash is good enough, switching from GPT-4o reduces costs by 97%.
Technique 1: Model Routing
Model routing means using cheap models for simple tasks and expensive models for complex ones. This is the highest-leverage technique and the first one to implement.
The principle: Most tasks in a typical AI product are not complex. Classification, summarization, short-form content generation, simple Q&A - these do not need GPT-4o. They work fine with GPT-4o-mini, Claude Haiku, or Gemini Flash. Reserve the expensive models for tasks that genuinely require them: complex reasoning, nuanced writing, multi-step problem solving.
How to implement:
Classify your tasks into three tiers:
Tier 1 (cheap, $0.05-0.30/1M): classification, short summaries, intent detection, simple extraction Models: GPT-4o-mini, Claude 3 Haiku, Gemini Flash
Tier 2 (medium, $0.50-2.00/1M): complex summaries, structured generation, question answering over long documents Models: Claude 3.5 Haiku, Gemini 1.5 Pro, Deepseek V3
Tier 3 (expensive, $3-15/1M): complex reasoning, agentic tasks, code generation requiring deep context, nuanced writing Models: GPT-4o, Claude 3.5 Sonnet
Real example: A customer support AI application was routing all tickets through GPT-4o at $2.50/$10.00. After analysis, 71% of tickets were simple intent classification (Tier 1), 22% were standard response generation (Tier 2), and only 7% required complex reasoning (Tier 3). Routing each task to the appropriate model reduced the average cost per ticket from $0.043 to $0.009. At 50,000 tickets/month, that is $2,150/month vs $450/month - a 79% reduction.
Technique 2: Prompt Caching
Prompt caching allows you to reuse the processed version of a system prompt across many requests, avoiding the cost of re-processing the same context every time.
Anthropic prompt caching:
Anthropic's prompt caching gives a 90% discount on cached input tokens. If your system prompt is 2,000 tokens and you send 10,000 requests per day, without caching you pay for 20,000,000 input tokens daily. With caching, requests 2 through 10,000 pay for 200 cached tokens each (10% of 2,000), plus the uncached user message. The cache persists for 5 minutes; if requests arrive within 5 minutes of each other, the cache stays warm.
# Anthropic prompt caching example
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp. "
"Here is our complete product documentation: [5000 token document here]",
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "How do I reset my password?"}
]
)
The cached portion costs $0.03/1M tokens instead of $0.25/1M tokens (Claude 3 Haiku rates). For a 5,000-token system prompt sent with 10,000 requests, caching reduces that cost from $12.50/day to $1.50/day.
OpenAI prompt caching:
OpenAI automatically caches prompts longer than 1,024 tokens at a 50% discount, with no code changes required. The cache key is the first 1,024+ tokens of your prompt, so keep the stable system prompt at the beginning and variable content at the end.
When caching helps most: Long system prompts (1,000+ tokens) sent with many requests per day. Customer support bots with detailed product documentation in the system prompt. RAG applications where the system instructions are long and consistent.
When caching does not help: Prompts under 1,024 tokens (OpenAI minimum for auto-caching). Applications where every request has a unique system prompt. Low-volume applications where the setup complexity is not worth the savings.
Technique 3: Batch API
Both OpenAI and Anthropic offer batch processing APIs that give a 50% discount in exchange for asynchronous processing with up to 24-hour completion time.
OpenAI Batch API:
from openai import OpenAI
import json
client = OpenAI()
# Create batch input file
requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": f"Summarize: {document}"}
],
"max_tokens": 150
}
}
for i, document in enumerate(documents)
]
# Upload batch file
batch_file = client.files.create(
file=json.dumps(requests).encode(),
purpose="batch"
)
# Submit batch
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
At 50% discount, GPT-4o drops from $2.50/$10.00 to $1.25/$5.00. GPT-4o-mini drops from $0.15/$0.60 to $0.075/$0.30.
When batch API makes sense: Nightly document processing, weekly analytics report generation, bulk content generation tasks that are not user-facing and have no time requirement.
When it does not make sense: Any user-facing task where latency matters. Real-time applications.
Technique 4: Prompt Compression
Long prompts cost more than short prompts. Many prompts include redundant context, excessive formatting, or verbose instructions that do not improve output quality.
What to cut:
Excessive whitespace: Extra blank lines and spacing add tokens with no quality benefit.
Redundant instructions: "Please carefully analyze the following text and provide a detailed, thoughtful response that addresses all the key points" is 24 tokens. "Analyze:" is 2 tokens. In testing on classification and extraction tasks, the verbose version does not produce better results.
Repeated context: If you include system context in both the system prompt and the user message, you are paying twice for the same tokens.
How much it saves: A prompt compression exercise across a customer-facing chat application found that 23% of input tokens were redundant whitespace, boilerplate instructions, and duplicated context. Removing them reduced the average prompt length from 680 tokens to 524 tokens, saving 24% of input costs with no measurable quality change.
Tools: LLMlingua (Microsoft Research) is an open source prompt compression library that can reduce prompt length by 2-20x with minimal quality loss for information-dense prompts.
Technique 5: Response Length Control
Output tokens are typically 3-5x more expensive than input tokens (see pricing table above). Controlling output length meaningfully reduces costs.
Strategies:
Set max_tokens explicitly. If your use case produces short responses (under 200 words), set max_tokens=300 to prevent the model from over-generating.
Use format instructions. "Respond in 2-3 sentences" or "List 5 bullet points, each under 10 words" produce shorter responses than open-ended prompts.
Request structured output. JSON output is often more compact than prose. For data extraction tasks, ask for JSON.
Impact: On a content tagging application running 50,000 requests/day, the average output was 180 tokens without length control and 85 tokens after adding explicit format instructions. Output cost dropped from $300/month to $141/month on the same model.
Technique 6: Local Models for Privacy-Sensitive and High-Volume Tasks
For tasks that do not require cloud-API quality and where data privacy matters or volume is very high, local models via Ollama can reduce per-query cost to zero (excluding infrastructure).
Use cases:
Code review on proprietary code: Many companies prohibit sending source code to external APIs. A self-hosted Ollama instance with Qwen 2.5 7B handles most code review tasks acceptably.
Internal document Q&A: A RAG system over internal documentation using Ollama for the LLM and ChromaDB for retrieval keeps sensitive data on-premises and eliminates per-query API costs.
High-volume classification: If you are classifying 500,000 items per day, even GPT-4o-mini at $0.15/1M input tokens costs $75/day. A local Llama 3.3 8B on a $200/month GPU instance handles the same volume for free.
Quality threshold: For most classification, extraction, and short-answer tasks, a well-prompted local 7B model produces acceptable results. For tasks requiring sophisticated reasoning, long-context understanding, or high-stakes outputs, cloud models remain worth the cost.
The Real Cost Reduction: A Case Study
A SaaS startup using AI for customer support, document generation, and content tagging came to us with a $800/month LLM bill growing 15% month-over-month.
Before: All tasks through GPT-4o. No model routing, no caching, no batching.
After (two-week implementation):
- Model routing: 72% of tasks moved to GPT-4o-mini (simple tasks) and Claude Haiku (support responses)
- Prompt caching on the support system's 3,000-token product knowledge base
- Batch API for nightly document tagging (previously done in real-time but not time-sensitive)
- Format instructions added to all generation prompts, reducing average output by 40%
Result: $320/month, down from $800/month. A 60% reduction, achieved without replacing any features or noticeably affecting output quality. The only user-facing quality change was that some responses were slightly more concise, which users actually preferred.
Priority Order for Implementation
If you are looking at a high LLM bill and want to know where to start:
-
Model routing first. Audit what models you are using for what tasks. Moving simple tasks to cheap models gives the largest gains with the least implementation work.
-
Prompt caching second. If you have long system prompts that repeat across many requests, caching is two lines of code for a large cost reduction.
-
Response length control third. Add explicit format instructions to your prompts. Easy, low-risk, and meaningful at scale.
-
Batch API fourth. Identify tasks that can be processed async (nightly jobs, bulk analysis). Move them to batch.
-
Local models last. The most powerful cost reduction, but also the most infrastructure work. Address the easy wins first.
Keep Reading
- LLM API Pricing Comparison 2026 - Complete pricing table for every major model with per-request cost examples
- Prompt Caching With Anthropic and OpenAI - Deep dive on prompt caching implementation
- Ollama Complete Guide 2026 - Set up local models to eliminate API costs for applicable tasks
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.