Without rate limiting and cost controls, a single runaway process or a single heavy user can generate thousands of dollars in API bills overnight. Per-user token limits, global monthly budget alerts, and circuit breakers that pause LLM calls when costs exceed thresholds are the three layers of defense every production LLM application needs. LiteLLM proxy is the most practical tool for adding all three without modifying your core application code.
Why Cost Controls Are Mandatory
LLM API costs are uniquely dangerous compared to traditional infrastructure costs because they scale linearly with usage and there is no natural bottleneck. A traditional web server gets slower under load. An LLM API just keeps processing requests and charging you for them.
Real scenarios where cost controls matter:
- A bug in your application causes an infinite retry loop on failed requests: your monthly bill accumulates in hours.
- A power user discovers they can extract large amounts of content through your chatbot and automates it: your bill for one user exceeds your entire planned monthly budget.
- A model update causes responses to be 3x longer than expected: your output token costs triple without any change in usage volume.
- A denial-of-wallet attack: malicious users spam your API-connected endpoint to inflate your bill.
Each of these scenarios is preventable with the controls described below.
Layer 1: Per-User Token Limits
Track token usage per user in your database and enforce limits at the API gateway layer, before the LLM call is made.
import { getDatabase } from "@/lib/mongodb/client";
import { ObjectId } from "mongodb";
interface UserTokenUsage {
user_id: ObjectId;
organization_id: ObjectId;
tokens_used_this_month: number;
month_key: string; // "2026-05"
limit: number;
}
async function checkAndIncrementTokenUsage(
userId: string,
organizationId: string,
estimatedTokens: number
): Promise<{ allowed: boolean; remaining: number }> {
const db = await getDatabase();
const monthKey = new Date().toISOString().slice(0, 7);
const usage = await db.collection<UserTokenUsage>("token_usage").findOne({
user_id: new ObjectId(userId),
organization_id: new ObjectId(organizationId),
month_key: monthKey,
});
const currentUsage = usage?.tokens_used_this_month ?? 0;
const limit = usage?.limit ?? 100_000; // default 100k tokens/month
if (currentUsage + estimatedTokens > limit) {
return { allowed: false, remaining: Math.max(0, limit - currentUsage) };
}
await db.collection("token_usage").updateOne(
{ user_id: new ObjectId(userId), organization_id: new ObjectId(organizationId), month_key: monthKey },
{ $inc: { tokens_used_this_month: estimatedTokens }, $setOnInsert: { limit } },
{ upsert: true }
);
return { allowed: true, remaining: limit - currentUsage - estimatedTokens };
}
Update the token count with actual usage after the API call completes (not just the estimate). Store actual token counts from the API response's usage field.
Layer 2: Global Budget Alerts
Set up a monthly budget alarm that notifies you before costs reach dangerous levels. Most providers have billing alert functionality:
OpenAI: Dashboard > Settings > Billing > Usage limits. Set a "soft limit" (notification) and "hard limit" (automatic cutoff).
Anthropic: Billing alerts are available in the Console under account settings. Set email notifications at 50%, 80%, and 100% of your monthly budget.
For custom monitoring, track costs in your own database and alert through Slack or email:
import boto3 # Or your alerting library of choice
def check_monthly_spend():
# Query your token usage logs
total_cost = calculate_current_month_cost()
budget = 500 # $500 monthly budget
if total_cost > budget * 0.8:
send_slack_alert(
f"LLM spend at ${total_cost:.2f} — 80% of ${budget} budget. "
f"Remaining: ${budget - total_cost:.2f}"
)
if total_cost > budget:
enable_circuit_breaker()
send_pagerduty_alert("LLM monthly budget exceeded, circuit breaker enabled")
Layer 3: Circuit Breakers
A circuit breaker automatically stops LLM API calls when costs or error rates exceed a threshold. It prevents runaway bills and gives you time to investigate.
interface CircuitBreakerState {
is_open: boolean;
opened_at: Date | null;
reason: string | null;
}
class LLMCircuitBreaker {
private state: CircuitBreakerState = {
is_open: false,
opened_at: null,
reason: null,
};
open(reason: string): void {
this.state = { is_open: true, opened_at: new Date(), reason };
console.error(`Circuit breaker opened: ${reason}`);
// Send alert to on-call engineer
}
close(): void {
this.state = { is_open: false, opened_at: null, reason: null };
}
isOpen(): boolean {
return this.state.is_open;
}
async callWithProtection<T>(fn: () => Promise<T>): Promise<T> {
if (this.isOpen()) {
throw new Error(`LLM circuit breaker open: ${this.state.reason}. Contact your administrator.`);
}
return fn();
}
}
export const circuitBreaker = new LLMCircuitBreaker();
Triggers for opening the circuit breaker:
- Monthly spend exceeds budget
- Error rate on LLM calls exceeds 10% over 5 minutes (provider outage)
- Average response latency exceeds 30 seconds (provider degradation)
Using LiteLLM Proxy for Built-In Controls
LiteLLM is an open source proxy that adds rate limiting, routing, and cost tracking in front of any LLM API. It exposes an OpenAI-compatible API, so your application connects to LiteLLM instead of directly to OpenAI or Anthropic.
# litellm_config.yaml
model_list:
- model_name: "gpt-4o-mini"
litellm_params:
model: "openai/gpt-4o-mini"
- model_name: "claude-haiku"
litellm_params:
model: "anthropic/claude-3-5-haiku-20241022"
router_settings:
routing_strategy: "cost-based-routing"
litellm_settings:
max_budget: 500 # Monthly budget in USD
budget_duration: "1mo"
success_callback: ["langsmith"]
pip install litellm
litellm --config litellm_config.yaml --port 4000
LiteLLM handles per-user tracking, global budgets, and routing in a single service. It is the fastest way to add these controls to an existing application without modifying application code.
Monitoring Spend Per Feature
Beyond global and per-user limits, track cost at the feature level. Add metadata tags to your API calls:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
metadata={
"feature": "meeting_summary",
"organization_id": org_id,
"user_id": user_id
}
)
Aggregate costs by feature tag weekly. If one feature is consuming 60% of your LLM budget and it is not a core feature, that is a signal to optimize or gate it.
Keep Reading
- Cutting LLM API Costs: The Complete Guide — All cost reduction strategies including rate limiting.
- Model Routing Guide — Route cheap queries to cheap models as a cost control mechanism.
- AI Budget for Startups — How much to budget and when to invest in controls.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.