Helicone: Track LLM Costs, Cache Responses, and Rate-Limit Users

Helicone sits between your app and LLM APIs as a one-line proxy — giving you per-user cost attribution, response caching, and rate limiting without changing your application logic.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

7 min read

// tags

#helicone#cost-monitoring#caching#rate-limiting#observability

FIG. ART-28

7 min read

“

Helicone: Track LLM Costs, Cache Responses, and Rate-Limit Users

// reading plan

sections

399

words

min read

// Developer Tools

Sentry Error Tracking Guide: From Setup to Production Insights

Sentry groups errors, captures breadcrumbs, and records session replays. Here is how to set it up in Next.js and use it to actually fix bugs faster.

9 min read

// Developer Tools

Monitoring Your Application with Prometheus and Grafana

The LLM Cost Problem

LLM API costs are invisible until they are not. A single prompt change can double token usage. One user can exhaust your monthly budget in a day. Without visibility, cost optimisation is guesswork. Helicone solves this by acting as a transparent proxy between your application and any LLM API — capturing every request, response, token count, and cost in real time.

One-Line Integration

The entire integration is changing base_url in your OpenAI client:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_OPENAI_KEY",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer YOUR_HELICONE_KEY",
    },
)

# All existing code works unchanged
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
)

Every request is now logged in the Helicone dashboard with model, tokens, cost, latency, and user attribution.

User-Level Cost Attribution

Pass a user ID header to see costs broken down per user:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
    extra_headers={"Helicone-User-Id": "user-456"},
)

The dashboard shows cost per user as a histogram — identify heavy users driving disproportionate spend.

Response Caching — Save Up to 80%

Enable semantic or exact caching to avoid re-computing identical or similar requests:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is PagedAttention?"}],
    extra_headers={
        "Helicone-Cache-Enabled": "true",
        "Helicone-Cache-Bucket-Max-Size": "3",  # Up to 3 cached variations
    },
)

Cache hits return in under 50ms and cost $0. For apps with repeated queries (FAQs, knowledge bases, templates), cache hit rates of 30–80% are common.

Rate Limiting Per API Key

extra_headers={
    "Helicone-User-Id": "user-456",
    "Helicone-RateLimit-Policy": "100;w=86400;u=requests;s=user",
    # 100 requests per 24 hours per user
}

Rate limit policies support requests, tokens, or cost as the unit, with per-user or global scope.

Custom Properties for Filtering

Tag requests with arbitrary metadata to slice your dashboard:

extra_headers={
    "Helicone-Property-Environment": "production",
    "Helicone-Property-Feature": "chat-assistant",
    "Helicone-Property-OrgId": "org-123",
}

Filter the dashboard by any property combination to see cost and latency by feature, environment, or customer segment.

Dashboard Metrics

The Helicone dashboard shows:

Total cost and tokens over time
Cost per model (compare GPT-4o vs GPT-4o-mini)
Latency percentiles (p50, p95, p99)
Error rate by model and endpoint
Cache hit rate and cost saved

Works With Any OpenAI-Compatible API

Change the proxy URL to use Helicone with Anthropic, Groq, Together AI, or any OpenAI-compatible endpoint:

# Anthropic via Helicone
base_url = "https://anthropic.helicone.ai"

# Groq via Helicone
base_url = "https://groq.helicone.ai/openai/v1"

Helicone has a generous free tier (10k requests/month). Paid plans start at $20/month. See pricing for full details.

Helicone: Track LLM Costs, Cache Responses, and Rate-Limit Users

Related Articles

Sentry Error Tracking Guide: From Setup to Production Insights

Monitoring Your Application with Prometheus and Grafana

The LLM Cost Problem

One-Line Integration

User-Level Cost Attribution

Response Caching — Save Up to 80%

Rate Limiting Per API Key

Custom Properties for Filtering

Dashboard Metrics

Works With Any OpenAI-Compatible API

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Redis Guide for Developers: Not Just a Cache

Helicone: Track LLM Costs, Cache Responses, and Rate-Limit Users

Related Articles

Sentry Error Tracking Guide: From Setup to Production Insights

Monitoring Your Application with Prometheus and Grafana

The LLM Cost Problem

One-Line Integration

User-Level Cost Attribution

Response Caching — Save Up to 80%

Rate Limiting Per API Key

Custom Properties for Filtering

Dashboard Metrics

Works With Any OpenAI-Compatible API

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Redis Guide for Developers: Not Just a Cache

The workspace your team
actually needs