LLM Rate Limiting and Cost Controls: How to Prevent Runaway API Bills

Runaway LLM bills happen without rate limits and budget alerts. Here is how to implement per-user limits, global budget controls, and circuit breakers that protect your margins.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#rate-limiting#cost-control#llm-budget#ai-infrastructure

FIG. ART-30

9 min read

“

LLM Rate Limiting and Cost Controls: How to Prevent Runaway API Bills

// reading plan

sections

978

words

min read

// AI Cost & Efficiency

Semantic Caching: How to Serve LLM Responses Without Calling the API

Semantic caching stores LLM responses and returns them when a new query is semantically similar to a cached one. In customer support applications, hit rates of 15-40% are realistic.

8 min read

// Developer Tools

How to Implement API Rate Limiting: A Complete Guide for Application Developers

Without rate limiting and cost controls, a single runaway process or a single heavy user can generate thousands of dollars in API bills overnight. Per-user token limits, global monthly budget alerts, and circuit breakers that pause LLM calls when costs exceed thresholds are the three layers of defense every production LLM application needs. LiteLLM proxy is the most practical tool for adding all three without modifying your core application code.

Why Cost Controls Are Mandatory

LLM API costs are uniquely dangerous compared to traditional infrastructure costs because they scale linearly with usage and there is no natural bottleneck. A traditional web server gets slower under load. An LLM API just keeps processing requests and charging you for them.

Real scenarios where cost controls matter:

A bug in your application causes an infinite retry loop on failed requests: your monthly bill accumulates in hours.
A power user discovers they can extract large amounts of content through your chatbot and automates it: your bill for one user exceeds your entire planned monthly budget.
A model update causes responses to be 3x longer than expected: your output token costs triple without any change in usage volume.
A denial-of-wallet attack: malicious users spam your API-connected endpoint to inflate your bill.

Each of these scenarios is preventable with the controls described below.

Layer 1: Per-User Token Limits

Track token usage per user in your database and enforce limits at the API gateway layer, before the LLM call is made.

import { getDatabase } from "@/lib/mongodb/client";
import { ObjectId } from "mongodb";

interface UserTokenUsage {
  user_id: ObjectId;
  organization_id: ObjectId;
  tokens_used_this_month: number;
  month_key: string; // "2026-05"
  limit: number;
}

async function checkAndIncrementTokenUsage(
  userId: string,
  organizationId: string,
  estimatedTokens: number
): Promise<{ allowed: boolean; remaining: number }> {
  const db = await getDatabase();
  const monthKey = new Date().toISOString().slice(0, 7);

  const usage = await db.collection<UserTokenUsage>("token_usage").findOne({
    user_id: new ObjectId(userId),
    organization_id: new ObjectId(organizationId),
    month_key: monthKey,
  });

  const currentUsage = usage?.tokens_used_this_month ?? 0;
  const limit = usage?.limit ?? 100_000; // default 100k tokens/month

  if (currentUsage + estimatedTokens > limit) {
    return { allowed: false, remaining: Math.max(0, limit - currentUsage) };
  }

  await db.collection("token_usage").updateOne(
    { user_id: new ObjectId(userId), organization_id: new ObjectId(organizationId), month_key: monthKey },
    { $inc: { tokens_used_this_month: estimatedTokens }, $setOnInsert: { limit } },
    { upsert: true }
  );

  return { allowed: true, remaining: limit - currentUsage - estimatedTokens };
}

Update the token count with actual usage after the API call completes (not just the estimate). Store actual token counts from the API response's usage field.

Layer 2: Global Budget Alerts

Set up a monthly budget alarm that notifies you before costs reach dangerous levels. Most providers have billing alert functionality:

OpenAI: Dashboard > Settings > Billing > Usage limits. Set a "soft limit" (notification) and "hard limit" (automatic cutoff).

Anthropic: Billing alerts are available in the Console under account settings. Set email notifications at 50%, 80%, and 100% of your monthly budget.

For custom monitoring, track costs in your own database and alert through Slack or email:

import boto3  # Or your alerting library of choice

def check_monthly_spend():
    # Query your token usage logs
    total_cost = calculate_current_month_cost()
    budget = 500  # $500 monthly budget

    if total_cost > budget * 0.8:
        send_slack_alert(
            f"LLM spend at ${total_cost:.2f} — 80% of ${budget} budget. "
            f"Remaining: ${budget - total_cost:.2f}"
        )

    if total_cost > budget:
        enable_circuit_breaker()
        send_pagerduty_alert("LLM monthly budget exceeded, circuit breaker enabled")

Layer 3: Circuit Breakers

A circuit breaker automatically stops LLM API calls when costs or error rates exceed a threshold. It prevents runaway bills and gives you time to investigate.

interface CircuitBreakerState {
  is_open: boolean;
  opened_at: Date | null;
  reason: string | null;
}

class LLMCircuitBreaker {
  private state: CircuitBreakerState = {
    is_open: false,
    opened_at: null,
    reason: null,
  };

  open(reason: string): void {
    this.state = { is_open: true, opened_at: new Date(), reason };
    console.error(`Circuit breaker opened: ${reason}`);
    // Send alert to on-call engineer
  }

  close(): void {
    this.state = { is_open: false, opened_at: null, reason: null };
  }

  isOpen(): boolean {
    return this.state.is_open;
  }

  async callWithProtection<T>(fn: () => Promise<T>): Promise<T> {
    if (this.isOpen()) {
      throw new Error(`LLM circuit breaker open: ${this.state.reason}. Contact your administrator.`);
    }
    return fn();
  }
}

export const circuitBreaker = new LLMCircuitBreaker();

Triggers for opening the circuit breaker:

Monthly spend exceeds budget
Error rate on LLM calls exceeds 10% over 5 minutes (provider outage)
Average response latency exceeds 30 seconds (provider degradation)

Using LiteLLM Proxy for Built-In Controls

LiteLLM is an open source proxy that adds rate limiting, routing, and cost tracking in front of any LLM API. It exposes an OpenAI-compatible API, so your application connects to LiteLLM instead of directly to OpenAI or Anthropic.

# litellm_config.yaml
model_list:
  - model_name: "gpt-4o-mini"
    litellm_params:
      model: "openai/gpt-4o-mini"
  - model_name: "claude-haiku"
    litellm_params:
      model: "anthropic/claude-3-5-haiku-20241022"

router_settings:
  routing_strategy: "cost-based-routing"

litellm_settings:
  max_budget: 500  # Monthly budget in USD
  budget_duration: "1mo"
  success_callback: ["langsmith"]

pip install litellm
litellm --config litellm_config.yaml --port 4000

LiteLLM handles per-user tracking, global budgets, and routing in a single service. It is the fastest way to add these controls to an existing application without modifying application code.

Monitoring Spend Per Feature

Beyond global and per-user limits, track cost at the feature level. Add metadata tags to your API calls:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    metadata={
        "feature": "meeting_summary",
        "organization_id": org_id,
        "user_id": user_id
    }
)

Aggregate costs by feature tag weekly. If one feature is consuming 60% of your LLM budget and it is not a core feature, that is a signal to optimize or gate it.

Keep Reading

Cutting LLM API Costs: The Complete Guide — All cost reduction strategies including rate limiting.
Model Routing Guide — Route cheap queries to cheap models as a cost control mechanism.
AI Budget for Startups — How much to budget and when to invest in controls.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

LLM Rate Limiting and Cost Controls: How to Prevent Runaway API Bills

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

Why Cost Controls Are Mandatory

Layer 1: Per-User Token Limits

Layer 2: Global Budget Alerts

Layer 3: Circuit Breakers

Using LiteLLM Proxy for Built-In Controls

Monitoring Spend Per Feature

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Implement API Rate Limiting: A Complete Guide for Application Developers

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

LLM Rate Limiting and Cost Controls: How to Prevent Runaway API Bills

Related Articles

Semantic Caching: How to Serve LLM Responses Without Calling the API

Why Cost Controls Are Mandatory

Layer 1: Per-User Token Limits

Layer 2: Global Budget Alerts

Layer 3: Circuit Breakers

Using LiteLLM Proxy for Built-In Controls

Monitoring Spend Per Feature

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Implement API Rate Limiting: A Complete Guide for Application Developers

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

The workspace your team
actually needs