Here is the complete LLM API pricing comparison for May 2026: GPT-4o costs $2.50/1M input and $10/1M output. Claude 3.5 Sonnet costs $3/1M input and $15/1M output. Gemini 1.5 Flash is the cheapest capable model at $0.075/1M input and $0.30/1M output. Deepseek V3 is competitive at $0.14/1M input and $0.28/1M output. For most mid-complexity tasks, Deepseek V3 or Gemini Flash are the rational default choices unless your task specifically requires GPT-4o or Claude Sonnet's quality ceiling.
All prices are from official provider pages as of May 2026. Prices change frequently — verify before committing to a cost model.
Complete Pricing Table
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | |---|---|---|---|---| | GPT-4o | OpenAI | $2.50 | $10.00 | 128k | | GPT-4o-mini | OpenAI | $0.15 | $0.60 | 128k | | GPT-4.1 | OpenAI | $2.00 | $8.00 | 1M | | GPT-4.1-mini | OpenAI | $0.40 | $1.60 | 1M | | o4-mini | OpenAI | $1.10 | $4.40 | 200k | | Claude 3.5 Sonnet | Anthropic | $3.00 | $15.00 | 200k | | Claude 3.5 Haiku | Anthropic | $0.80 | $4.00 | 200k | | Claude 3 Haiku | Anthropic | $0.25 | $1.25 | 200k | | Claude 3 Opus | Anthropic | $15.00 | $75.00 | 200k | | Gemini 1.5 Pro | Google | $1.25 | $5.00 | 1M+ | | Gemini 1.5 Flash | Google | $0.075 | $0.30 | 1M+ | | Gemini 2.0 Flash | Google | $0.10 | $0.40 | 1M+ | | Gemini 2.5 Pro | Google | $1.25 | $10.00 | 1M+ | | Deepseek V3 | Deepseek | $0.14 | $0.28 | 128k | | Deepseek R1 | Deepseek | $0.55 | $2.19 | 128k | | Llama 3.3 70B (Groq) | Groq | $0.59 | $0.79 | 128k | | Mistral Large | Mistral | $2.00 | $6.00 | 128k | | Mistral Small | Mistral | $0.10 | $0.30 | 128k |
Cost Per 1,000 Requests at Different Task Sizes
To make these numbers concrete, here is the cost per 1,000 API requests at three typical task sizes.
Short task (200 token prompt, 100 token response = 300 tokens total):
| Model | Cost per 1,000 requests | |---|---| | GPT-4o | $1.50 | | GPT-4o-mini | $0.09 | | Claude 3.5 Sonnet | $1.65 | | Claude 3 Haiku | $0.175 | | Gemini Flash | $0.045 | | Deepseek V3 | $0.056 |
Medium task (1,000 token prompt, 500 token response = 1,500 tokens total):
| Model | Cost per 1,000 requests | |---|---| | GPT-4o | $7.50 | | GPT-4o-mini | $0.45 | | Claude 3.5 Sonnet | $10.50 | | Claude 3 Haiku | $0.875 | | Gemini Flash | $0.225 | | Deepseek V3 | $0.28 |
Long task (4,000 token prompt, 2,000 token response = 6,000 tokens total):
| Model | Cost per 1,000 requests | |---|---| | GPT-4o | $30.00 | | GPT-4o-mini | $1.80 | | Claude 3.5 Sonnet | $42.00 | | Claude 3 Haiku | $3.50 | | Gemini Flash | $0.90 | | Deepseek V3 | $1.12 |
Hidden Costs: Rate Limits, Retries, and Context Overhead
The published per-token price is not the only cost. Three hidden costs affect real-world spending:
Rate limit retry overhead: If your application hits rate limits and retries, those retried requests count against your bill. At scale, 2-5% of requests typically need retry logic, adding 2-5% to effective costs. OpenAI's Tier 3 and above (developers spending $500+/month) have generous limits; smaller tiers hit limits more often during peak hours.
Context window overhead: Many applications maintain conversation history or inject retrieved documents. A chat application maintaining 10 turns of context at 200 tokens/turn adds 2,000 tokens of context overhead to every request. At 100,000 daily requests on Claude 3.5 Sonnet ($3/1M input), that context overhead alone costs $600/month.
System prompt tokens: Long system prompts repeat on every request. A 3,000-token system prompt sent with 50,000 daily requests is 150M tokens/month. On GPT-4o, that is $375/month just for system prompt tokens. Prompt caching (Technique 2 in the cost-cutting guide) addresses this specifically.
Which Model for Which Task
Coding (complex, multi-file): Claude 3.5 Sonnet or GPT-4o. The quality difference between these and cheaper models is meaningful for complex programming tasks. Deepseek V3 is competitive at a fraction of the cost for many coding tasks, though Sonnet still leads on the most complex problems.
Coding (simple, autocomplete-style): GPT-4o-mini, Claude 3 Haiku, or Deepseek V3. Quality is sufficient for boilerplate, standard algorithms, and straightforward implementations.
Summarization: Gemini Flash or Deepseek V3. Summarization is a task where cheap models perform nearly as well as expensive ones. The quality ceiling for summarization is rarely what the most expensive model can do — it is what the document contains.
Classification and extraction: Gemini Flash, GPT-4o-mini, or Claude 3 Haiku. Classification is the clearest example of model routing opportunity. Expensive models add cost without adding quality.
Complex reasoning and analysis: GPT-4o, Claude 3.5 Sonnet, or Deepseek R1. These tasks — multi-step problem solving, nuanced argument analysis, strategic planning assistance — genuinely benefit from the quality ceiling of top models.
Long context (hundreds of thousands of tokens): Gemini 1.5 Pro or Flash. With context windows up to 1M+ tokens, Gemini is the only practical option for processing very long documents in a single call.
When to Switch from Expensive to Cheap: Quality Threshold Analysis
The practical question is: "How much quality am I giving up by switching to a cheaper model, and is that tradeoff worth the cost reduction?"
The answer varies by task type. Here is a rough framework based on common AI product use cases:
Tasks where cheap models are 90%+ of the quality:
- Text classification (sentiment, intent, category)
- Short-form summarization under 500 words
- Simple data extraction (names, dates, amounts from documents)
- FAQ-style question answering over well-defined knowledge bases
Tasks where cheap models are 75-90% of the quality:
- Long-form content generation
- Complex summarization requiring nuance
- Coding assistance for common patterns
- Translation
Tasks where cheap models are below 75% of the quality:
- Multi-step reasoning chains
- Complex debugging in novel codebases
- Nuanced analysis of ambiguous information
- Tasks requiring judgment under uncertainty
For the first category, switch to cheap models immediately. For the second, A/B test your specific use case. For the third, the expensive model is probably worth it.
Keep Reading
- Cutting LLM API Costs by 50%+ — Every technique for reducing your LLM bill with implementation details
- Prompt Caching With Anthropic and OpenAI — How to get 50-90% off repeated system prompt tokens
- How to Evaluate LLMs — How to measure whether a cheaper model actually meets your quality threshold
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.