The LLM Cost Problem
LLM API costs are invisible until they are not. A single prompt change can double token usage. One user can exhaust your monthly budget in a day. Without visibility, cost optimisation is guesswork. Helicone solves this by acting as a transparent proxy between your application and any LLM API — capturing every request, response, token count, and cost in real time.
One-Line Integration
The entire integration is changing base_url in your OpenAI client:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_OPENAI_KEY",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer YOUR_HELICONE_KEY",
},
)
# All existing code works unchanged
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain PagedAttention"}],
)
Every request is now logged in the Helicone dashboard with model, tokens, cost, latency, and user attribution.
User-Level Cost Attribution
Pass a user ID header to see costs broken down per user:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "..."}],
extra_headers={"Helicone-User-Id": "user-456"},
)
The dashboard shows cost per user as a histogram — identify heavy users driving disproportionate spend.
Response Caching — Save Up to 80%
Enable semantic or exact caching to avoid re-computing identical or similar requests:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is PagedAttention?"}],
extra_headers={
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "3", # Up to 3 cached variations
},
)
Cache hits return in under 50ms and cost $0. For apps with repeated queries (FAQs, knowledge bases, templates), cache hit rates of 30–80% are common.
Rate Limiting Per API Key
extra_headers={
"Helicone-User-Id": "user-456",
"Helicone-RateLimit-Policy": "100;w=86400;u=requests;s=user",
# 100 requests per 24 hours per user
}
Rate limit policies support requests, tokens, or cost as the unit, with per-user or global scope.
Custom Properties for Filtering
Tag requests with arbitrary metadata to slice your dashboard:
extra_headers={
"Helicone-Property-Environment": "production",
"Helicone-Property-Feature": "chat-assistant",
"Helicone-Property-OrgId": "org-123",
}
Filter the dashboard by any property combination to see cost and latency by feature, environment, or customer segment.
Dashboard Metrics
The Helicone dashboard shows:
- Total cost and tokens over time
- Cost per model (compare GPT-4o vs GPT-4o-mini)
- Latency percentiles (p50, p95, p99)
- Error rate by model and endpoint
- Cache hit rate and cost saved
Works With Any OpenAI-Compatible API
Change the proxy URL to use Helicone with Anthropic, Groq, Together AI, or any OpenAI-compatible endpoint:
# Anthropic via Helicone
base_url = "https://anthropic.helicone.ai"
# Groq via Helicone
base_url = "https://groq.helicone.ai/openai/v1"
Helicone has a generous free tier (10k requests/month). Paid plans start at $20/month. See pricing for full details.