LLM API Rate Limits: What They Are and How to Handle Them
LLM API rate limits enforce per-minute token and request caps. Exponential backoff with jitter, request queuing, and caching are the standard strategies for handling them gracefully.
LLM API rate limits are per-minute (or per-day) caps on how many tokens and requests you can send. When you exceed them, the API returns a 429 error. Every major LLM provider enforces rate limits for fair use and infrastructure stability. Handling them correctly requires exponential backoff with jitter, request queuing, and — for high-volume applications — caching and tier upgrades.
What Rate Limits Are
Rate limits have two dimensions:
TPM (Tokens Per Minute): the total number of input plus output tokens allowed per minute. This is usually the binding constraint for high-volume applications.
RPM (Requests Per Minute): the number of API calls per minute, regardless of token count. This often constrains applications making many small requests.
Some providers also have daily limits (TPD/RPD) that compound the per-minute limits.
Current Limits by Provider and Tier
These change frequently. Always check current documentation. As of early 2025:
OpenAI Tier 1 (new accounts with $5–50 spend):
GPT-4o: 500 RPM, 30,000 TPM
GPT-4o-mini: 500 RPM, 200,000 TPM
OpenAI Tier 4 ($1,000+ spend):
GPT-4o: 10,000 RPM, 800,000 TPM
Anthropic Tier 1 (new accounts):
Claude 3.5 Sonnet: 50 RPM, 40,000 TPM
Anthropic Tier 4 ($1,000+ spend):
Claude 3.5 Sonnet: 4,000 RPM, 400,000 TPM
Groq Free Tier:
Llama 3.3 70B: 30 RPM, 6,000 TPM per minute
Note: Groq's free tier is remarkably low on TPM but useful for development. Their paid tier is significantly higher.
These figures are from official provider documentation and should be verified against current limits before relying on them in production planning.
Team workspace
Ship faster with chat, meetings, and projects in one place — Zlyqor.
Fair use: without limits, a single customer could consume all available capacity, degrading service for others.
Cost control: rate limits prevent runaway usage from bugs or misconfigured applications from generating enormous unexpected bills.
Infrastructure stability: LLM inference is computationally expensive. Rate limits allow providers to capacity plan and maintain consistent quality under load.
The Correct Retry Strategy: Exponential Backoff With Jitter
When you hit a 429, the naive approach is to wait and retry immediately. This is wrong for two reasons: the rate limit window has not reset, and if multiple clients are all retrying at the same time (the "thundering herd" problem), they all hit the limit again simultaneously.
Exponential backoff with jitter is the correct approach:
The retry-after header: most providers include a retry-after header in 429 responses indicating how many seconds to wait. Respect this value when present.
Request Queuing
For applications that generate bursts of API requests (batch processing jobs, multiple simultaneous users), a request queue prevents hitting rate limits by spacing out requests:
import PQueue from "p-queue";
// Allow max 50 requests per second (3000 RPM)
const queue = new PQueue({ interval: 1000, intervalCap: 50 });
async function queuedApiCall(prompt: string) {
return queue.add(() => openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
}));
}
// All calls go through the queue, which spaces them automatically
const results = await Promise.all(prompts.map(queuedApiCall));
The p-queue library makes this straightforward. Set intervalCap to 80% of your RPM limit to leave headroom.
Model Fallback
When you hit the rate limit on your primary model, fall back to an alternative:
async function callWithFallback(messages: Message[]) {
try {
return await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
} catch (error) {
if (error.status === 429) {
// Fall back to Claude when GPT-4o is rate-limited
return await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages,
});
}
throw error;
}
}
This requires maintaining API keys for multiple providers, but dramatically improves reliability for production applications.
Caching to Reduce API Calls
Many applications make redundant API calls for identical or near-identical inputs. Caching eliminates these:
Exact match caching: hash the input messages and cache the response. Works for FAQ-style applications where many users ask the same questions.
import { createHash } from "crypto";
const cache = new Map<string, string>();
async function cachedCompletion(messages: Message[]): Promise<string> {
const key = createHash("sha256")
.update(JSON.stringify(messages))
.digest("hex");
if (cache.has(key)) {
return cache.get(key)!;
}
const response = await openai.chat.completions.create({ model: "gpt-4o", messages });
const text = response.choices[0].message.content ?? "";
cache.set(key, text);
return text;
}
For production, use Redis instead of an in-memory Map so caching persists across restarts and scales across instances.
Upgrading Tiers: How It Works
OpenAI and Anthropic use spend-based tier progression:
OpenAI: Tier 2 requires $50 paid, Tier 3 requires $100, Tier 4 requires $1,000, Tier 5 requires $10,000 in spending. Limits increase significantly at each tier.
Anthropic: similar progression. Tier 2 at $250 spend, Tier 3 at $2,500, Tier 4 at $10,000.
For new projects expected to hit limits quickly, spending the threshold amount early (even on development/testing) to unlock higher limits before production launch is a practical strategy.
When Rate Limits Block You: Optimize vs Upgrade
Signs you need to optimize first:
You are making redundant API calls (implement caching)
Your requests are unnecessarily large (reduce context, trim system prompts)
You are not using batching where possible
A cheaper, smaller model would meet your quality bar (switch and get higher token limits at lower cost)
Signs you need to upgrade your tier:
Your application has genuine throughput requirements that exceed your current limits
You have implemented all reasonable optimizations
The cost of a higher tier is less than the cost of engineering more complex workarounds
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.
Frequently Asked Questions
What are LLM API rate limits?
LLM API rate limits are caps imposed by providers on the number of requests (RPM) and tokens (TPM) you can send per minute. They prevent abuse and ensure fair resource allocation. Exceeding them returns a 429 HTTP status code.
How do LLM API rate limits work?
Rate limits work on a rolling window basis. Each API call consumes tokens and counts toward your RPM. Once the limit is reached, subsequent requests are rejected until the window resets (usually within a minute). Providers like OpenAI and Anthropic use spend-based tiers to increase limits.
What are the best practices for handling LLM API rate limits?
Best practices include: 1) Exponential backoff with jitter on 429 errors, 2) Request queuing with libraries like p-queue, 3) Caching identical responses, 4) Model fallback to alternative providers, and 5) Upgrading your tier if optimizations aren't enough.
How much does it cost to handle LLM API rate limits?
Handling rate limits itself is free (it's just code). However, upgrading tiers requires spending thresholds: OpenAI Tier 4 needs $1,000 spent, Anthropic Tier 4 needs $10,000. Caching and optimization reduce costs by minimizing redundant API calls.
Is handling LLM API rate limits worth it in 2026?
Yes, absolutely. As LLM usage scales, rate limits become a bottleneck. Proper handling ensures reliability, reduces latency, and prevents service outages. Investing in retry logic, queuing, and caching pays off quickly for production applications.
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)
An honest, benchmark-driven comparison of Claude 3.5 Sonnet vs GPT-4o covering coding, document analysis, multimodal tasks, pricing, and real-world verdict.