Semantic Caching: How to Serve LLM Responses Without Calling the API

Semantic caching stores LLM responses and returns them when a new query is semantically similar to a cached one. In customer support applications, hit rates of 15-40% are realistic.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#semantic-caching#llm-cost#vector-search#ai-optimization

FIG. ART-30

8 min read

“

Semantic Caching: How to Serve LLM Responses Without Calling the API

// reading plan

sections

884

words

min read

// AI Cost & Efficiency

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Tokenomics quantifies token usage per step in agentic software engineering. This post breaks down the numbers, tradeoffs, and practical tips for cost optimization.

4 min read

// AI Cost & Efficiency

Why Does MCP Use So Many Tokens? (And How to Fix It)

Choosing the Right Similarity Threshold

The threshold is the most important tuning parameter. Set it too low and you return wrong cached answers for different-enough questions. Set it too high and you miss valid cache hits.

Typical thresholds:

0.97+: Very conservative. Only nearly identical queries hit the cache. Hit rate is low but precision is high.
0.93-0.96: Balanced. Works well for customer support and FAQ scenarios.
0.90-0.92: Aggressive. Higher hit rate but occasional mismatches for subtly different questions.

The right threshold depends on the consequences of a wrong cache hit. For a customer support bot where a slightly wrong answer is noticed and reported, use 0.95+. For an internal search tool where occasional imprecision is tolerated, 0.92 works.

Libraries That Implement Semantic Caching

Building semantic caching from scratch is straightforward, but several libraries wrap the complexity:

GPTCache (github.com/zilliztech/GPTCache): Open source, integrates with LangChain and LlamaIndex, supports multiple vector backends. Best for teams already using LangChain.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()

# Drop-in replacement for openai client
response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

Redis with vector search: Production-grade option. Redis Stack includes vector similarity search. Suitable for high-volume production deployments where you need distributed caching.

Realistic Hit Rates

Hit rates vary significantly by application type:

Customer support bots: 20-40% hit rate. Users repeatedly ask the same questions about returns, shipping, account issues.
Internal knowledge bases: 15-30% hit rate. Employees ask variations of similar questions about policies and procedures.
Code assistant tools: 5-15% hit rate. Code questions are more varied and context-dependent.
General purpose assistants: 5-10% hit rate. Open-ended queries are more unique.

For a customer support application handling 50,000 queries per month at $0.60/1M output tokens (GPT-4o-mini), a 25% hit rate saves roughly $75-150/month. At GPT-4o prices, those same savings would be $750-1,500/month.

Cache Invalidation

LLM responses can become stale. If your product return policy changes, cached answers about returns are now wrong. Implement TTL (time-to-live) for cached entries:

import time

cache_entry = {
    "query": query,
    "embedding": embedding,
    "response": response,
    "created_at": time.time(),
    "ttl_seconds": 86400 * 7  # 7 days
}

def is_fresh(entry: dict) -> bool:
    return time.time() - entry["created_at"] < entry["ttl_seconds"]

For knowledge-base chatbots, invalidate the full cache when source documents are updated. For general assistants, a 24-48 hour TTL is usually sufficient.

Keep Reading

Model Routing Guide - Combine routing with caching for maximum cost reduction.
Prompt Caching: Anthropic and OpenAI Guide - The provider-level caching for static prompt prefixes.
Cutting LLM API Costs: The Complete Guide - Full framework for all LLM cost reduction techniques.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Semantic Caching: How to Serve LLM Responses Without Calling the API

Related Articles

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Why Exact Match Caching Is Insufficient

How Semantic Caching Works

Choosing the Right Similarity Threshold

Libraries That Implement Semantic Caching

Realistic Hit Rates

Cache Invalidation

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Why Does MCP Use So Many Tokens? (And How to Fix It)

Building Semantic Search: Finding Results by Meaning, Not Keywords

Semantic Caching: How to Serve LLM Responses Without Calling the API

Related Articles

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Why Exact Match Caching Is Insufficient

How Semantic Caching Works

Choosing the Right Similarity Threshold

Libraries That Implement Semantic Caching

Realistic Hit Rates

Cache Invalidation

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Why Does MCP Use So Many Tokens? (And How to Fix It)

Building Semantic Search: Finding Results by Meaning, Not Keywords

The workspace your team
actually needs