What are LLM API rate limits?

LLM API rate limits are caps imposed by providers on the number of requests (RPM) and tokens (TPM) you can send per minute. They prevent abuse and ensure fair resource allocation. Exceeding them returns a 429 HTTP status code.

How do LLM API rate limits work?

Rate limits work on a rolling window basis. Each API call consumes tokens and counts toward your RPM. Once the limit is reached, subsequent requests are rejected until the window resets (usually within a minute). Providers like OpenAI and Anthropic use spend-based tiers to increase limits.

What are the best practices for handling LLM API rate limits?

Best practices include: 1) Exponential backoff with jitter on 429 errors, 2) Request queuing with libraries like p-queue, 3) Caching identical responses, 4) Model fallback to alternative providers, and 5) Upgrading your tier if optimizations aren't enough.

How much does it cost to handle LLM API rate limits?

Handling rate limits itself is free (it's just code). However, upgrading tiers requires spending thresholds: OpenAI Tier 4 needs $1,000 spent, Anthropic Tier 4 needs $10,000. Caching and optimization reduce costs by minimizing redundant API calls.

Is handling LLM API rate limits worth it in 2026?

Yes, absolutely. As LLM usage scales, rate limits become a bottleneck. Proper handling ensures reliability, reduces latency, and prevents service outages. Investing in retry logic, queuing, and caching pays off quickly for production applications.

// back to blog

LLM & Language Models

LLM API Rate Limits: What They Are and How to Handle Them

LLM API rate limits enforce per-minute token and request caps. Exponential backoff with jitter, request queuing, and caching are the standard strategies for handling them gracefully.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

5 min read

// tags

#rate-limits

// reading plan

sections

1,060

words

min read

// LLM & Language Models

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

OpenAI's frontier models and Codex are now available on AWS through Amazon Bedrock and SageMaker. This post covers what's included, how it works, and the practical tradeoffs for teams considering this integration.

4 min read

// LLM & Language Models

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

Why Rate Limits Exist

Fair use: without limits, a single customer could consume all available capacity, degrading service for others.

Cost control: rate limits prevent runaway usage from bugs or misconfigured applications from generating enormous unexpected bills.

Infrastructure stability: LLM inference is computationally expensive. Rate limits allow providers to capacity plan and maintain consistent quality under load.

The Correct Retry Strategy: Exponential Backoff With Jitter

When you hit a 429, the naive approach is to wait and retry immediately. This is wrong for two reasons: the rate limit window has not reset, and if multiple clients are all retrying at the same time (the "thundering herd" problem), they all hit the limit again simultaneously.

Exponential backoff with jitter is the correct approach:

async function callWithRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status !== 429 || attempt === maxRetries - 1) {
        throw error;
      }

      // Exponential backoff: 1s, 2s, 4s, 8s, 16s
      const baseDelay = Math.pow(2, attempt) * 1000;
      // Jitter: randomize ±25% to avoid synchronized retries
      const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
      const delay = baseDelay + jitter;

      console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
  throw new Error("Max retries exceeded");
}

The retry-after header: most providers include a retry-after header in 429 responses indicating how many seconds to wait. Respect this value when present.

Request Queuing

For applications that generate bursts of API requests (batch processing jobs, multiple simultaneous users), a request queue prevents hitting rate limits by spacing out requests:

import PQueue from "p-queue";

// Allow max 50 requests per second (3000 RPM)
const queue = new PQueue({ interval: 1000, intervalCap: 50 });

async function queuedApiCall(prompt: string) {
  return queue.add(() => openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }],
  }));
}

// All calls go through the queue, which spaces them automatically
const results = await Promise.all(prompts.map(queuedApiCall));

The p-queue library makes this straightforward. Set intervalCap to 80% of your RPM limit to leave headroom.

Model Fallback

When you hit the rate limit on your primary model, fall back to an alternative:

async function callWithFallback(messages: Message[]) {
  try {
    return await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
    });
  } catch (error) {
    if (error.status === 429) {
      // Fall back to Claude when GPT-4o is rate-limited
      return await anthropic.messages.create({
        model: "claude-3-5-sonnet-20241022",
        max_tokens: 1024,
        messages,
      });
    }
    throw error;
  }
}

This requires maintaining API keys for multiple providers, but dramatically improves reliability for production applications.

Caching to Reduce API Calls

Many applications make redundant API calls for identical or near-identical inputs. Caching eliminates these:

Exact match caching: hash the input messages and cache the response. Works for FAQ-style applications where many users ask the same questions.

import { createHash } from "crypto";

const cache = new Map<string, string>();

async function cachedCompletion(messages: Message[]): Promise<string> {
  const key = createHash("sha256")
    .update(JSON.stringify(messages))
    .digest("hex");

  if (cache.has(key)) {
    return cache.get(key)!;
  }

  const response = await openai.chat.completions.create({ model: "gpt-4o", messages });
  const text = response.choices[0].message.content ?? "";
  cache.set(key, text);
  return text;
}

For production, use Redis instead of an in-memory Map so caching persists across restarts and scales across instances.

Upgrading Tiers: How It Works

OpenAI and Anthropic use spend-based tier progression:

OpenAI: Tier 2 requires $50 paid, Tier 3 requires $100, Tier 4 requires $1,000, Tier 5 requires $10,000 in spending. Limits increase significantly at each tier.

Anthropic: similar progression. Tier 2 at $250 spend, Tier 3 at $2,500, Tier 4 at $10,000.

For new projects expected to hit limits quickly, spending the threshold amount early (even on development/testing) to unlock higher limits before production launch is a practical strategy.

When Rate Limits Block You: Optimize vs Upgrade

Signs you need to optimize first:

You are making redundant API calls (implement caching)
Your requests are unnecessarily large (reduce context, trim system prompts)
You are not using batching where possible
A cheaper, smaller model would meet your quality bar (switch and get higher token limits at lower cost)

Signs you need to upgrade your tier:

Your application has genuine throughput requirements that exceed your current limits
You have implemented all reasonable optimizations
The cost of a higher tier is less than the cost of engineering more complex workarounds

Both strategies are often needed in combination.

Keep Reading

Cutting LLM API Costs: Complete Guide — Rate limits and costs are related problems
Streaming LLM Responses Guide — Streaming affects how rate limits apply
Best LLM for Coding 2026 — Different models have different rate limit structures

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

LLM API Rate Limits: What They Are and How to Handle Them

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

What Rate Limits Are

Current Limits by Provider and Tier

Why Rate Limits Exist

The Correct Retry Strategy: Exponential Backoff With Jitter

Request Queuing

Model Fallback

Caching to Reduce API Calls

Upgrading Tiers: How It Works

When Rate Limits Block You: Optimize vs Upgrade

Keep Reading

Frequently Asked Questions

What are LLM API rate limits?

How do LLM API rate limits work?

What are the best practices for handling LLM API rate limits?

How much does it cost to handle LLM API rate limits?

Is handling LLM API rate limits worth it in 2026?

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

LLM API Rate Limits: What They Are and How to Handle Them

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

What Rate Limits Are

Current Limits by Provider and Tier

Why Rate Limits Exist

The Correct Retry Strategy: Exponential Backoff With Jitter

Request Queuing

Model Fallback

Caching to Reduce API Calls

Upgrading Tiers: How It Works

When Rate Limits Block You: Optimize vs Upgrade

Keep Reading

Frequently Asked Questions

What are LLM API rate limits?

How do LLM API rate limits work?

What are the best practices for handling LLM API rate limits?

How much does it cost to handle LLM API rate limits?

Is handling LLM API rate limits worth it in 2026?

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

The workspace your team
actually needs