Semantic caching stores LLM responses and returns cached answers when a new query is semantically similar to a previously answered one, not just an exact string match. Because many production LLM queries are paraphrases of each other, semantic caching can eliminate 15-40% of API calls in applications with repetitive query patterns. The implementation requires embedding queries, finding nearest neighbors in a vector store, and checking whether similarity exceeds your threshold before making an API call.
Why Exact Match Caching Is Insufficient
Traditional caching returns a stored result when the input exactly matches a previous input. For LLM applications, exact match caching has a near-zero hit rate because users almost never type the same message twice.
"How do I reset my password?" and "What's the process for resetting a password?" are semantically identical but will never match under exact caching. A customer support bot receives hundreds of variations on the same fifteen questions every day. Exact match caching helps none of them.
Semantic caching solves this by measuring semantic similarity rather than string equality. Queries that mean the same thing, even when worded differently, hit the cache and return the stored answer without a new API call.
How Semantic Caching Works
The process has four steps:
- Receive a new query. The user sends a message.
- Embed the query. Convert the query to a dense vector using an embedding model (text-embedding-3-small, or similar).
- Search the cache. Find the nearest neighbor in your vector store (Redis with vector search, Pinecone, Weaviate, Chroma).
- Check the similarity threshold. If cosine similarity exceeds your threshold (typically 0.92-0.97), return the cached response. Otherwise, call the LLM and store the new response.
import numpy as np
from openai import OpenAI
client = OpenAI()
class SemanticCache:
def __init__(self, similarity_threshold=0.95):
self.threshold = similarity_threshold
self.cache = [] # In production: Redis or Pinecone
def embed(self, text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(self, a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get(self, query: str) -> str | None:
query_embedding = self.embed(query)
for item in self.cache:
sim = self.cosine_similarity(query_embedding, item["embedding"])
if sim >= self.threshold:
return item["response"]
return None
def set(self, query: str, response: str):
embedding = self.embed(query)
self.cache.append({
"query": query,
"embedding": embedding,
"response": response
})
def query_with_cache(cache: SemanticCache, user_query: str) -> str:
cached = cache.get(user_query)
if cached:
return cached # Cache hit, no API call
# Cache miss: call LLM
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_query}]
)
answer = response.choices[0].message.content
cache.set(user_query, answer)
return answer
Choosing the Right Similarity Threshold
The threshold is the most important tuning parameter. Set it too low and you return wrong cached answers for different-enough questions. Set it too high and you miss valid cache hits.
Typical thresholds:
- 0.97+: Very conservative. Only nearly identical queries hit the cache. Hit rate is low but precision is high.
- 0.93-0.96: Balanced. Works well for customer support and FAQ scenarios.
- 0.90-0.92: Aggressive. Higher hit rate but occasional mismatches for subtly different questions.
The right threshold depends on the consequences of a wrong cache hit. For a customer support bot where a slightly wrong answer is noticed and reported, use 0.95+. For an internal search tool where occasional imprecision is tolerated, 0.92 works.
Libraries That Implement Semantic Caching
Building semantic caching from scratch is straightforward, but several libraries wrap the complexity:
GPTCache (github.com/zilliztech/GPTCache): Open source, integrates with LangChain and LlamaIndex, supports multiple vector backends. Best for teams already using LangChain.
from gptcache import cache
from gptcache.adapter import openai
cache.init()
cache.set_openai_key()
# Drop-in replacement for openai client
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is your return policy?"}]
)
Redis with vector search: Production-grade option. Redis Stack includes vector similarity search. Suitable for high-volume production deployments where you need distributed caching.
Realistic Hit Rates
Hit rates vary significantly by application type:
- Customer support bots: 20-40% hit rate. Users repeatedly ask the same questions about returns, shipping, account issues.
- Internal knowledge bases: 15-30% hit rate. Employees ask variations of similar questions about policies and procedures.
- Code assistant tools: 5-15% hit rate. Code questions are more varied and context-dependent.
- General purpose assistants: 5-10% hit rate. Open-ended queries are more unique.
For a customer support application handling 50,000 queries per month at $0.60/1M output tokens (GPT-4o-mini), a 25% hit rate saves roughly $75-150/month. At GPT-4o prices, those same savings would be $750-1,500/month.
Cache Invalidation
LLM responses can become stale. If your product return policy changes, cached answers about returns are now wrong. Implement TTL (time-to-live) for cached entries:
import time
cache_entry = {
"query": query,
"embedding": embedding,
"response": response,
"created_at": time.time(),
"ttl_seconds": 86400 * 7 # 7 days
}
def is_fresh(entry: dict) -> bool:
return time.time() - entry["created_at"] < entry["ttl_seconds"]
For knowledge-base chatbots, invalidate the full cache when source documents are updated. For general assistants, a 24-48 hour TTL is usually sufficient.
Keep Reading
- Model Routing Guide — Combine routing with caching for maximum cost reduction.
- Prompt Caching: Anthropic and OpenAI Guide — The provider-level caching for static prompt prefixes.
- Cutting LLM API Costs: The Complete Guide — Full framework for all LLM cost reduction techniques.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.