Prompt caching lets you reuse the processed version of a long system prompt across many API requests, instead of paying to re-process the same tokens every time. On Anthropic, caching gives a 90% discount on cached input tokens. On OpenAI, caching gives a 50% discount on prompts longer than 1,024 tokens. For applications with long, stable system prompts, this is one of the easiest cost reductions available, often requiring fewer than 10 lines of code changes.
Here is how it works on both platforms, when it applies, and the cases where it does not help.
What Prompt Caching Is
When you send a request to an LLM API, the model processes every token in your input from scratch. For a request with a 3,000-token system prompt plus a 50-token user question, you pay for 3,050 input tokens.
If you send 1,000 such requests, you pay for 3,050,000 input tokens. The 3,000-token system prompt is re-processed 1,000 times, even though it is identical every time.
Prompt caching short-circuits this. After the first request, the processed representation of the system prompt is stored in cache. Subsequent requests that include the same prompt prefix pay a heavily discounted rate for the cached portion, and full price only for the new tokens (the unique user message).
The cache is not persistent indefinitely. Anthropic's cache expires after 5 minutes of inactivity. OpenAI's cache duration is not published precisely but resets when the prompt changes.
Anthropic Prompt Caching: 90% Discount on Cached Tokens
Anthropic's prompt caching uses explicit cache_control markers to tell the API which portions of the prompt to cache.
Cache pricing (Claude 3.5 Sonnet as example):
- Normal input tokens: $3.00/1M
- Cache write (first request): $3.75/1M (25% more expensive to create the cache)
- Cache read (subsequent requests): $0.30/1M (90% cheaper than normal)
The math: if you cache a 3,000-token system prompt and send 1,000 requests in a 5-minute window:
- Without caching: 3,000 tokens x 1,000 requests x $3.00/1M = $9.00
- With caching: 1 cache write at $3.75/1M ($0.011) + 999 cache reads at $0.30/1M ($0.899) = $0.91
At 1,000 requests on a 3,000-token system prompt, caching saves approximately $8.09 or 90%.
Implementation:
import anthropic
client = anthropic.Anthropic()
# Your long, stable system prompt
SYSTEM_PROMPT = """
You are a customer support agent for Acme Corp. Here are our policies:
Return Policy: Customers can return items within 30 days of purchase...
[3000 tokens of product documentation and policies]
"""
def handle_customer_request(user_message: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": user_message}
]
)
return response.content[0].text
The cache_control: {"type": "ephemeral"} marker on the system prompt text block tells Anthropic to cache that block after the first request. Subsequent requests with the same marker on the same text will use the cached version.
Important detail: You can have up to 4 cache breakpoints in a single request, each marking a different cacheable section. For RAG applications, you can cache both the system instructions and the retrieved document context separately.
How to check if caching is working:
The API response includes usage metadata with cache_creation_input_tokens and cache_read_input_tokens. If cache_read_input_tokens is non-zero on subsequent requests, the cache is active.
OpenAI Prompt Caching: 50% Discount, Automatic
OpenAI's caching is automatic. No code changes are required. Any prompt longer than 1,024 tokens gets the cached portion discounted at 50% of the normal input rate, provided the same prefix has been processed recently.
Cache pricing (GPT-4o as example):
- Normal input tokens: $2.50/1M
- Cached input tokens: $1.25/1M (50% discount, applied automatically)
The catch: Caching only applies to the first 1,024+ tokens of your prompt. To maximize caching benefit, put your stable system prompt first and variable content (the user message, dynamic context) at the end. OpenAI's caching uses the prompt prefix as the cache key — if you put variable content at the beginning, the cache never hits.
Recommended structure for OpenAI prompts to maximize cache hits:
[Position 1: Stable system instructions — 1,000+ tokens]
[Position 2: Stable knowledge base or context — if applicable]
[Position 3: Variable retrieved documents — if RAG]
[Position 4: Dynamic user message — always last]
Limitation: OpenAI does not expose cache hit/miss information in the API response (unlike Anthropic). You can infer cache performance from cost monitoring, but you cannot directly verify it per-request.
When Caching Helps Most
Customer support bots with detailed product documentation: A system prompt containing 5,000 tokens of product knowledge, return policies, and support scripts gets cached after the first request. Every subsequent support ticket in the cache window costs 90% less for that portion.
RAG applications with long system instructions: If your RAG system includes a consistent 2,000-token instruction block for how the AI should handle retrieved documents, that instruction block is cacheable even though the retrieved documents change per-request.
Code review tools with large code style guides: A system prompt containing your organization's 4,000-token code style guide and review criteria can be cached, so only the submitted code snippet is billed at full rate.
Document Q&A tools: If users ask multiple questions about the same document in the same session, caching the document content dramatically reduces costs for subsequent questions.
When Caching Does Not Help
Short prompts. OpenAI requires 1,024+ tokens for caching. Anthropic caching is technically available for any length, but the economics only make sense for longer prompts. If your system prompt is under 500 tokens, caching saves very little.
Unique system prompts per request. If every request has a unique system prompt (personalized instructions, dynamic context that changes completely per user), caching provides no benefit because the cache never hits.
Low request volume. The cache write cost for Anthropic is 25% above normal. If you only send a handful of requests per day to the same endpoint, the cache write cost may exceed the cache read savings.
Prompts where dynamic content comes first. For OpenAI, if you put variable content at the beginning of your prompt, the prefix cache key changes every request. Always structure prompts stable-content-first.
Practical Implementation Checklist
- Audit your current system prompts. Identify which ones are long (1,000+ tokens) and stable across requests.
- For Anthropic: add
cache_control: {"type": "ephemeral"}to stable system prompt blocks. - For OpenAI: ensure stable content is at the beginning of your prompts, before any dynamic content.
- Monitor
cache_read_input_tokensin Anthropic responses to verify caching is active. - Track cost before and after to measure actual savings.
Keep Reading
- Cutting LLM API Costs by 50%+ — All six cost reduction techniques, including model routing and batch API
- LLM API Pricing Comparison 2026 — Full pricing table to understand what you are saving against
- How to Evaluate LLMs — Ensure cheaper alternatives meet your quality bar before switching
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.