Semantic Caching: How to Serve LLM Responses Without Calling the API

Semantic caching stores LLM responses and returns them when a new query is semantically similar to a cached one. In customer support applications, hit rates of 15-40% are realistic.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#semantic-caching#llm-cost#vector-search#ai-optimization

FIG. ART-30

8 min read

“

Semantic Caching: How to Serve LLM Responses Without Calling the API

// reading plan

sections

884

words

min read

// Machine Learning

Building Semantic Search: Finding Results by Meaning, Not Keywords

How semantic search works, embedding-based architecture, pgvector vs ChromaDB, hybrid search with BM25, and cross-encoder re-ranking for better results.

10 min read

// LLM & Language Models

LLM Embeddings Explained: What They Are and How to Use Them

Semantic caching stores LLM responses and returns cached answers when a new query is semantically similar to a previously answered one, not just an exact string match. Because many production LLM queries are paraphrases of each other, semantic caching can eliminate 15-40% of API calls in applications with repetitive query patterns. The implementation requires embedding queries, finding nearest neighbors in a vector store, and checking whether similarity exceeds your threshold before making an API call.

Why Exact Match Caching Is Insufficient

Traditional caching returns a stored result when the input exactly matches a previous input. For LLM applications, exact match caching has a near-zero hit rate because users almost never type the same message twice.

"How do I reset my password?" and "What's the process for resetting a password?" are semantically identical but will never match under exact caching. A customer support bot receives hundreds of variations on the same fifteen questions every day. Exact match caching helps none of them.

Semantic caching solves this by measuring semantic similarity rather than string equality. Queries that mean the same thing, even when worded differently, hit the cache and return the stored answer without a new API call.

How Semantic Caching Works

The process has four steps:

Receive a new query. The user sends a message.
Embed the query. Convert the query to a dense vector using an embedding model (text-embedding-3-small, or similar).
Search the cache. Find the nearest neighbor in your vector store (Redis with vector search, Pinecone, Weaviate, Chroma).
Check the similarity threshold. If cosine similarity exceeds your threshold (typically 0.92-0.97), return the cached response. Otherwise, call the LLM and store the new response.

import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.threshold = similarity_threshold
        self.cache = []  # In production: Redis or Pinecone

    def embed(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, query: str) -> str | None:
        query_embedding = self.embed(query)
        for item in self.cache:
            sim = self.cosine_similarity(query_embedding, item["embedding"])
            if sim >= self.threshold:
                return item["response"]
        return None

    def set(self, query: str, response: str):
        embedding = self.embed(query)
        self.cache.append({
            "query": query,
            "embedding": embedding,
            "response": response
        })

def query_with_cache(cache: SemanticCache, user_query: str) -> str:
    cached = cache.get(user_query)
    if cached:
        return cached  # Cache hit, no API call

    # Cache miss: call LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_query}]
    )
    answer = response.choices[0].message.content
    cache.set(user_query, answer)
    return answer

Choosing the Right Similarity Threshold

The threshold is the most important tuning parameter. Set it too low and you return wrong cached answers for different-enough questions. Set it too high and you miss valid cache hits.

Typical thresholds:

0.97+: Very conservative. Only nearly identical queries hit the cache. Hit rate is low but precision is high.
0.93-0.96: Balanced. Works well for customer support and FAQ scenarios.
0.90-0.92: Aggressive. Higher hit rate but occasional mismatches for subtly different questions.

The right threshold depends on the consequences of a wrong cache hit. For a customer support bot where a slightly wrong answer is noticed and reported, use 0.95+. For an internal search tool where occasional imprecision is tolerated, 0.92 works.

Libraries That Implement Semantic Caching

Building semantic caching from scratch is straightforward, but several libraries wrap the complexity:

GPTCache (github.com/zilliztech/GPTCache): Open source, integrates with LangChain and LlamaIndex, supports multiple vector backends. Best for teams already using LangChain.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()

# Drop-in replacement for openai client
response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

Redis with vector search: Production-grade option. Redis Stack includes vector similarity search. Suitable for high-volume production deployments where you need distributed caching.

Realistic Hit Rates

Hit rates vary significantly by application type:

Customer support bots: 20-40% hit rate. Users repeatedly ask the same questions about returns, shipping, account issues.
Internal knowledge bases: 15-30% hit rate. Employees ask variations of similar questions about policies and procedures.
Code assistant tools: 5-15% hit rate. Code questions are more varied and context-dependent.
General purpose assistants: 5-10% hit rate. Open-ended queries are more unique.

For a customer support application handling 50,000 queries per month at $0.60/1M output tokens (GPT-4o-mini), a 25% hit rate saves roughly $75-150/month. At GPT-4o prices, those same savings would be $750-1,500/month.

Cache Invalidation

LLM responses can become stale. If your product return policy changes, cached answers about returns are now wrong. Implement TTL (time-to-live) for cached entries:

import time

cache_entry = {
    "query": query,
    "embedding": embedding,
    "response": response,
    "created_at": time.time(),
    "ttl_seconds": 86400 * 7  # 7 days
}

def is_fresh(entry: dict) -> bool:
    return time.time() - entry["created_at"] < entry["ttl_seconds"]

For knowledge-base chatbots, invalidate the full cache when source documents are updated. For general assistants, a 24-48 hour TTL is usually sufficient.

Keep Reading

Model Routing Guide — Combine routing with caching for maximum cost reduction.
Prompt Caching: Anthropic and OpenAI Guide — The provider-level caching for static prompt prefixes.
Cutting LLM API Costs: The Complete Guide — Full framework for all LLM cost reduction techniques.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Semantic Caching: How to Serve LLM Responses Without Calling the API

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

LLM Embeddings Explained: What They Are and How to Use Them

Why Exact Match Caching Is Insufficient

How Semantic Caching Works

Choosing the Right Similarity Threshold

Libraries That Implement Semantic Caching

Realistic Hit Rates

Cache Invalidation

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

Semantic Caching: How to Serve LLM Responses Without Calling the API

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

LLM Embeddings Explained: What They Are and How to Use Them

Why Exact Match Caching Is Insufficient

How Semantic Caching Works

Choosing the Right Similarity Threshold

Libraries That Implement Semantic Caching

Realistic Hit Rates

Cache Invalidation

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Flash Attention Explained: The Engineering Trick Behind Long-Context LLMs

The workspace your team
actually needs