LLM API rate limits are per-minute (or per-day) caps on how many tokens and requests you can send. When you exceed them, the API returns a 429 error. Every major LLM provider enforces rate limits for fair use and infrastructure stability. Handling them correctly requires exponential backoff with jitter, request queuing, and — for high-volume applications — caching and tier upgrades.
What Rate Limits Are
Rate limits have two dimensions:
TPM (Tokens Per Minute): the total number of input plus output tokens allowed per minute. This is usually the binding constraint for high-volume applications.
RPM (Requests Per Minute): the number of API calls per minute, regardless of token count. This often constrains applications making many small requests.
Some providers also have daily limits (TPD/RPD) that compound the per-minute limits.
Current Limits by Provider and Tier
These change frequently. Always check current documentation. As of early 2025:
OpenAI Tier 1 (new accounts with $5-50 spend):
- GPT-4o: 500 RPM, 30,000 TPM
- GPT-4o-mini: 500 RPM, 200,000 TPM
OpenAI Tier 4 ($1,000+ spend):
- GPT-4o: 10,000 RPM, 800,000 TPM
Anthropic Tier 1 (new accounts):
- Claude 3.5 Sonnet: 50 RPM, 40,000 TPM
Anthropic Tier 4 ($1,000+ spend):
- Claude 3.5 Sonnet: 4,000 RPM, 400,000 TPM
Groq Free Tier:
- Llama 3.3 70B: 30 RPM, 6,000 TPM per minute
Note: Groq's free tier is remarkably low on TPM but useful for development. Their paid tier is significantly higher.
These figures are from official provider documentation and should be verified against current limits before relying on them in production planning.
Why Rate Limits Exist
Fair use: without limits, a single customer could consume all available capacity, degrading service for others.
Cost control: rate limits prevent runaway usage from bugs or misconfigured applications from generating enormous unexpected bills.
Infrastructure stability: LLM inference is computationally expensive. Rate limits allow providers to capacity plan and maintain consistent quality under load.
The Correct Retry Strategy: Exponential Backoff With Jitter
When you hit a 429, the naive approach is to wait and retry immediately. This is wrong for two reasons: the rate limit window has not reset, and if multiple clients are all retrying at the same time (the "thundering herd" problem), they all hit the limit again simultaneously.
Exponential backoff with jitter is the correct approach:
async function callWithRetry<T>(
fn: () => Promise<T>,
maxRetries = 5
): Promise<T> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (error.status !== 429 || attempt === maxRetries - 1) {
throw error;
}
// Exponential backoff: 1s, 2s, 4s, 8s, 16s
const baseDelay = Math.pow(2, attempt) * 1000;
// Jitter: randomize ±25% to avoid synchronized retries
const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
const delay = baseDelay + jitter;
console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw new Error("Max retries exceeded");
}
The retry-after header: most providers include a retry-after header in 429 responses indicating how many seconds to wait. Respect this value when present.
Request Queuing
For applications that generate bursts of API requests (batch processing jobs, multiple simultaneous users), a request queue prevents hitting rate limits by spacing out requests:
import PQueue from "p-queue";
// Allow max 50 requests per second (3000 RPM)
const queue = new PQueue({ interval: 1000, intervalCap: 50 });
async function queuedApiCall(prompt: string) {
return queue.add(() => openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
}));
}
// All calls go through the queue, which spaces them automatically
const results = await Promise.all(prompts.map(queuedApiCall));
The p-queue library makes this straightforward. Set intervalCap to 80% of your RPM limit to leave headroom.
Model Fallback
When you hit the rate limit on your primary model, fall back to an alternative:
async function callWithFallback(messages: Message[]) {
try {
return await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
} catch (error) {
if (error.status === 429) {
// Fall back to Claude when GPT-4o is rate-limited
return await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages,
});
}
throw error;
}
}
This requires maintaining API keys for multiple providers, but dramatically improves reliability for production applications.
Caching to Reduce API Calls
Many applications make redundant API calls for identical or near-identical inputs. Caching eliminates these:
Exact match caching: hash the input messages and cache the response. Works for FAQ-style applications where many users ask the same questions.
import { createHash } from "crypto";
const cache = new Map<string, string>();
async function cachedCompletion(messages: Message[]): Promise<string> {
const key = createHash("sha256")
.update(JSON.stringify(messages))
.digest("hex");
if (cache.has(key)) {
return cache.get(key)!;
}
const response = await openai.chat.completions.create({ model: "gpt-4o", messages });
const text = response.choices[0].message.content ?? "";
cache.set(key, text);
return text;
}
For production, use Redis instead of an in-memory Map so caching persists across restarts and scales across instances.
Upgrading Tiers: How It Works
OpenAI and Anthropic use spend-based tier progression:
OpenAI: Tier 2 requires $50 paid, Tier 3 requires $100, Tier 4 requires $1,000, Tier 5 requires $10,000 in spending. Limits increase significantly at each tier.
Anthropic: similar progression. Tier 2 at $250 spend, Tier 3 at $2,500, Tier 4 at $10,000.
For new projects expected to hit limits quickly, spending the threshold amount early (even on development/testing) to unlock higher limits before production launch is a practical strategy.
When Rate Limits Block You: Optimize vs Upgrade
Signs you need to optimize first:
- You are making redundant API calls (implement caching)
- Your requests are unnecessarily large (reduce context, trim system prompts)
- You are not using batching where possible
- A cheaper, smaller model would meet your quality bar (switch and get higher token limits at lower cost)
Signs you need to upgrade your tier:
- Your application has genuine throughput requirements that exceed your current limits
- You have implemented all reasonable optimizations
- The cost of a higher tier is less than the cost of engineering more complex workarounds
Both strategies are often needed in combination.
Keep Reading
- Cutting LLM API Costs: Complete Guide — Rate limits and costs are related problems
- Streaming LLM Responses Guide — Streaming affects how rate limits apply
- Best LLM for Coding 2026 — Different models have different rate limit structures
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.