LLM API Rate Limits: What They Are and How to Handle Them

LLM API rate limits enforce per-minute token and request caps. Exponential backoff with jitter, request queuing, and caching are the standard strategies for handling them gracefully.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

7 min read

// tags

#rate-limits#api#exponential-backoff#production-llm#openai-api

FIG. ART-25

7 min read

“

LLM API Rate Limits: What They Are and How to Handle Them

// reading plan

sections

1,060

words

min read

// Developer Tools

Testing HTTP APIs Effectively: Beyond the Happy Path

Unit vs integration tests, test database strategies, auth in tests, and making sure your 400, 401, 403, 404, and 500 responses are all verified.

10 min read

// LLM & Language Models

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

LLM API rate limits are per-minute (or per-day) caps on how many tokens and requests you can send. When you exceed them, the API returns a 429 error. Every major LLM provider enforces rate limits for fair use and infrastructure stability. Handling them correctly requires exponential backoff with jitter, request queuing, and — for high-volume applications — caching and tier upgrades.

What Rate Limits Are

Rate limits have two dimensions:

TPM (Tokens Per Minute): the total number of input plus output tokens allowed per minute. This is usually the binding constraint for high-volume applications.

RPM (Requests Per Minute): the number of API calls per minute, regardless of token count. This often constrains applications making many small requests.

Some providers also have daily limits (TPD/RPD) that compound the per-minute limits.

Current Limits by Provider and Tier

These change frequently. Always check current documentation. As of early 2025:

OpenAI Tier 1 (new accounts with $5-50 spend):

GPT-4o: 500 RPM, 30,000 TPM
GPT-4o-mini: 500 RPM, 200,000 TPM

OpenAI Tier 4 ($1,000+ spend):

GPT-4o: 10,000 RPM, 800,000 TPM

Anthropic Tier 1 (new accounts):

Claude 3.5 Sonnet: 50 RPM, 40,000 TPM

Anthropic Tier 4 ($1,000+ spend):

Claude 3.5 Sonnet: 4,000 RPM, 400,000 TPM

Groq Free Tier:

Llama 3.3 70B: 30 RPM, 6,000 TPM per minute

Note: Groq's free tier is remarkably low on TPM but useful for development. Their paid tier is significantly higher.

These figures are from official provider documentation and should be verified against current limits before relying on them in production planning.

Why Rate Limits Exist

Fair use: without limits, a single customer could consume all available capacity, degrading service for others.

Cost control: rate limits prevent runaway usage from bugs or misconfigured applications from generating enormous unexpected bills.

Infrastructure stability: LLM inference is computationally expensive. Rate limits allow providers to capacity plan and maintain consistent quality under load.

The Correct Retry Strategy: Exponential Backoff With Jitter

When you hit a 429, the naive approach is to wait and retry immediately. This is wrong for two reasons: the rate limit window has not reset, and if multiple clients are all retrying at the same time (the "thundering herd" problem), they all hit the limit again simultaneously.

Exponential backoff with jitter is the correct approach:

async function callWithRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status !== 429 || attempt === maxRetries - 1) {
        throw error;
      }

      // Exponential backoff: 1s, 2s, 4s, 8s, 16s
      const baseDelay = Math.pow(2, attempt) * 1000;
      // Jitter: randomize ±25% to avoid synchronized retries
      const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
      const delay = baseDelay + jitter;

      console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
  throw new Error("Max retries exceeded");
}

The retry-after header: most providers include a retry-after header in 429 responses indicating how many seconds to wait. Respect this value when present.

Request Queuing

For applications that generate bursts of API requests (batch processing jobs, multiple simultaneous users), a request queue prevents hitting rate limits by spacing out requests:

import PQueue from "p-queue";

// Allow max 50 requests per second (3000 RPM)
const queue = new PQueue({ interval: 1000, intervalCap: 50 });

async function queuedApiCall(prompt: string) {
  return queue.add(() => openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
  }));
}

// All calls go through the queue, which spaces them automatically
const results = await Promise.all(prompts.map(queuedApiCall));

The p-queue library makes this straightforward. Set intervalCap to 80% of your RPM limit to leave headroom.

Model Fallback

When you hit the rate limit on your primary model, fall back to an alternative:

async function callWithFallback(messages: Message[]) {
  try {
    return await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
    });
  } catch (error) {
    if (error.status === 429) {
      // Fall back to Claude when GPT-4o is rate-limited
      return await anthropic.messages.create({
        model: "claude-3-5-sonnet-20241022",
        max_tokens: 1024,
        messages,
      });
    }
    throw error;
  }
}

This requires maintaining API keys for multiple providers, but dramatically improves reliability for production applications.

Caching to Reduce API Calls

Many applications make redundant API calls for identical or near-identical inputs. Caching eliminates these:

Exact match caching: hash the input messages and cache the response. Works for FAQ-style applications where many users ask the same questions.

import { createHash } from "crypto";

const cache = new Map<string, string>();

async function cachedCompletion(messages: Message[]): Promise<string> {
  const key = createHash("sha256")
    .update(JSON.stringify(messages))
    .digest("hex");

  if (cache.has(key)) {
    return cache.get(key)!;
  }

  const response = await openai.chat.completions.create({ model: "gpt-4o", messages });
  const text = response.choices[0].message.content ?? "";
  cache.set(key, text);
  return text;
}

For production, use Redis instead of an in-memory Map so caching persists across restarts and scales across instances.

Upgrading Tiers: How It Works

OpenAI and Anthropic use spend-based tier progression:

OpenAI: Tier 2 requires $50 paid, Tier 3 requires $100, Tier 4 requires $1,000, Tier 5 requires $10,000 in spending. Limits increase significantly at each tier.

Anthropic: similar progression. Tier 2 at $250 spend, Tier 3 at $2,500, Tier 4 at $10,000.

For new projects expected to hit limits quickly, spending the threshold amount early (even on development/testing) to unlock higher limits before production launch is a practical strategy.

When Rate Limits Block You: Optimize vs Upgrade

Signs you need to optimize first:

You are making redundant API calls (implement caching)
Your requests are unnecessarily large (reduce context, trim system prompts)
You are not using batching where possible
A cheaper, smaller model would meet your quality bar (switch and get higher token limits at lower cost)

Signs you need to upgrade your tier:

Your application has genuine throughput requirements that exceed your current limits
You have implemented all reasonable optimizations
The cost of a higher tier is less than the cost of engineering more complex workarounds

Both strategies are often needed in combination.

Keep Reading

Cutting LLM API Costs: Complete Guide — Rate limits and costs are related problems
Streaming LLM Responses Guide — Streaming affects how rate limits apply
Best LLM for Coding 2026 — Different models have different rate limit structures

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

LLM API Rate Limits: What They Are and How to Handle Them

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

What Rate Limits Are

Current Limits by Provider and Tier

Why Rate Limits Exist

The Correct Retry Strategy: Exponential Backoff With Jitter

Request Queuing

Model Fallback

Caching to Reduce API Calls

Upgrading Tiers: How It Works

When Rate Limits Block You: Optimize vs Upgrade

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM API Rate Limits: What They Are and How to Handle Them

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

What Rate Limits Are

Current Limits by Provider and Tier

Why Rate Limits Exist

The Correct Retry Strategy: Exponential Backoff With Jitter

Request Queuing

Model Fallback

Caching to Reduce API Calls

Upgrading Tiers: How It Works

When Rate Limits Block You: Optimize vs Upgrade

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

The workspace your team
actually needs