Semantic caching stores LLM responses and returns cached answers when a new query is semantically similar to a previously answered one, not just an exact string match. Because many production LLM queries are paraphrases of each other, semantic caching can eliminate 15-40% of API calls in applications with repetitive query patterns. The implementation requires embedding queries, finding nearest neighbors in a vector store, and checking whether similarity exceeds your threshold before making an API call.
Why Exact Match Caching Is Insufficient
Traditional caching returns a stored result when the input exactly matches a previous input. For LLM applications, exact match caching has a near-zero hit rate because users almost never type the same message twice.
"How do I reset my password?" and "What's the process for resetting a password?" are semantically identical but will never match under exact caching. A customer support bot receives hundreds of variations on the same fifteen questions every day. Exact match caching helps none of them.
Semantic caching solves this by measuring semantic similarity rather than string equality. Queries that mean the same thing, even when worded differently, hit the cache and return the stored answer without a new API call.
How Semantic Caching Works
The process has four steps:
- Receive a new query. The user sends a message.
- Embed the query. Convert the query to a dense vector using an embedding model (text-embedding-3-small, or similar).
- Search the cache. Find the nearest neighbor in your vector store (Redis with vector search, Pinecone, Weaviate, Chroma).
- Check the similarity threshold. If cosine similarity exceeds your threshold (typically 0.92-0.97), return the cached response. Otherwise, call the LLM and store the new response.
import numpy as np
from openai import OpenAI
client = OpenAI()
class SemanticCache:
def __init__(self, similarity_threshold=0.95):
self.threshold = similarity_threshold
self.cache = [] # In production: Redis or Pinecone
def embed(self, text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(self, a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get(self, query: str) -> str | None:
query_embedding = self.embed(query)
for item in self.cache:
sim = self.cosine_similarity(query_embedding, item["embedding"])
if sim >= self.threshold:
return item["response"]
return None
def set(self, query: str, response: str):
embedding = self.embed(query)
self.cache.append({
"query": query,
"embedding": embedding,
"response": response
})
def query_with_cache(cache: SemanticCache, user_query: str) -> str:
cached = cache.get(user_query)
if cached:
return cached # Cache hit, no API call
# Cache miss: call LLM
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_query}]
)
answer = response.choices[0].message.content
cache.set(user_query, answer)
return answer