What an AI Gateway Does
An AI gateway is a proxy layer that sits between your application and one or more LLM providers. It handles:
- Routing — send requests to different models based on rules (cost, capability, latency)
- Fallback — if OpenAI is down, automatically retry on Anthropic
- Caching — return cached responses for identical or semantically similar requests
- Rate limiting — enforce per-user or per-team token budgets
- Cost tracking — log spend per model, per user, per endpoint
- Observability — unified logging across providers
Without a gateway, you write all of this logic in your application code, duplicated across every service that calls an LLM.
Architecture Comparison
| | Cloudflare AI Gateway | LiteLLM | Portkey | Custom (Hono/Next.js) | |---|---|---|---|---| | Deployment | Managed edge | Self-hosted or cloud | SaaS | Self-hosted | | Providers supported | Major APIs | 100+ | 100+ | You build it | | Semantic caching | No | No | Yes | You build it | | Fallback routing | Basic | Yes | Yes | You build it | | Analytics | Yes | Basic | Advanced | You build it | | Pricing | Free tier | Open source / $20+ | Free tier / $49+ | Infrastructure cost only | | Setup time | 5 minutes | 30 minutes | 10 minutes | Days |
Cloudflare AI Gateway
Cloudflare AI Gateway is free, runs at Cloudflare's edge, and requires no infrastructure to manage. It works by changing the base URL in your existing OpenAI SDK calls:
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
# Add your Cloudflare gateway URL as base_url
base_url="https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_name}/openai",
)
# All requests now flow through Cloudflare gateway
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
Features: real-time analytics dashboard, request/response logging, rate limiting by IP or custom header, edge caching for identical requests. The free tier covers most individual and small team use cases.
Limitation: no semantic caching (must be exact match), limited provider fallback configuration.
LiteLLM
LiteLLM is the open-source choice for teams that need 100+ provider support and want to run everything on their own infrastructure:
import litellm
# Same code works for any provider
response = litellm.completion(
model="openai/gpt-4o", # OpenAI
# model="anthropic/claude-3-5-sonnet-20241022", # Anthropic
# model="together_ai/deepseek-ai/DeepSeek-R1", # Together
# model="ollama/llama3.3", # Local Ollama
messages=[{"role": "user", "content": "Hello"}],
)
LiteLLM also ships a proxy server that exposes an OpenAI-compatible API in front of all providers:
pip install litellm[proxy]
litellm --model gpt-4o --fallbacks '[{"model": "claude-3-5-sonnet-20241022"}]' --port 4000
Your application calls localhost:4000 (OpenAI format), and LiteLLM handles provider routing, fallback, and retry logic.
Portkey
Portkey is the enterprise-grade option. Key features that differentiate it:
- Semantic caching — cache responses to semantically similar (not just identical) requests, using embeddings to match queries. Can reduce LLM spend by 20-40% on repetitive workloads.
- Virtual keys — team members use Portkey virtual keys instead of raw provider API keys, enabling centralized key rotation without updating every service
- Guardrails — content filtering, PII detection, and output validation rules that run on every request/response
- Per-request routing — route different users to different model tiers based on custom attributes
from portkey_ai import Portkey
client = Portkey(
api_key=os.environ["PORTKEY_API_KEY"],
virtual_key=os.environ["OPENAI_VIRTUAL_KEY"],
config={
"strategy": {"mode": "fallback"},
"targets": [
{"virtual_key": os.environ["OPENAI_VIRTUAL_KEY"], "model": "gpt-4o"},
{"virtual_key": os.environ["ANTHROPIC_VIRTUAL_KEY"], "model": "claude-3-5-sonnet-20241022"},
],
"cache": {"mode": "semantic", "max_age": 3600},
},
)
Custom Gateway with Hono
For teams that want full control without managing LiteLLM or paying Portkey fees, a custom gateway built on Hono (Cloudflare Workers) is surprisingly lightweight:
import { Hono } from "hono";
import OpenAI from "openai";
const app = new Hono();
app.post("/v1/chat/completions", async (c) => {
const body = await c.req.json();
const { model, messages, ...rest } = body;
// Custom routing logic
const provider = model.startsWith("claude") ? "anthropic" : "openai";
const apiKey = provider === "anthropic"
? c.env.ANTHROPIC_KEY
: c.env.OPENAI_KEY;
const client = new OpenAI({
apiKey,
baseURL: provider === "anthropic" ? "https://api.anthropic.com/v1" : undefined,
});
// Log to your analytics system
await c.env.ANALYTICS.writeDataPoint({ indexes: [model], blobs: [JSON.stringify(messages)] });
return client.chat.completions.create({ model, messages, ...rest });
});
export default app;
This runs on Cloudflare Workers (zero cold starts, global edge) and gives you complete control over routing logic. The trade-off is maintenance burden.
Which to Choose
- Cloudflare AI Gateway — starting out, need basic analytics and caching, want zero infrastructure
- LiteLLM — open source requirement, 100+ provider support, self-hosted everything
- Portkey — enterprise, need semantic caching and guardrails, willing to pay SaaS fees
- Custom — highest control, existing Cloudflare/Hono expertise, long-term cost optimization at scale