Semantic search finds results by meaning rather than exact word match. A user searching "cheap laptop" will get results containing "affordable notebook computer" because the meanings are similar even though the words are different. Keyword search (Elasticsearch, Postgres full-text search) cannot do this — it requires the exact words to match.
This guide walks through building semantic search from architecture to production.
How Semantic Search Works
The core idea is simple: represent both documents and queries as points in a high-dimensional vector space such that semantically similar text is close together. Finding the most relevant documents for a query becomes a nearest-neighbor search problem.
The three-step process:
-
Index time — embed every document using an embedding model. Each document becomes a vector of floating-point numbers (typically 384 to 1536 dimensions). Store these vectors in a vector database or a vector-capable relational database.
-
Query time — embed the user's query using the same embedding model. The query becomes a vector.
-
Search — find the documents whose vectors are closest to the query vector. "Closeness" is typically measured by cosine similarity or dot product.
The quality of semantic search depends almost entirely on the quality of the embedding model. A good embedding model places "affordable notebook computer" and "cheap laptop" near each other in vector space because they mean the same thing. A poor embedding model does not.
Choosing an Embedding Model
For most use cases, start with one of these:
OpenAI text-embedding-3-small — 1536 dimensions, $0.02 per million tokens, excellent quality, no infrastructure required. Good default if you are already using the OpenAI API.
OpenAI text-embedding-3-large — higher quality than small, 3x the cost. Worth it for mission-critical search where quality matters more than cost.
sentence-transformers/all-MiniLM-L6-v2 — 384 dimensions, runs locally, fast, free. Surprisingly good quality for its size. The right choice if you want fully local search with no API costs.
BAAI/bge-large-en-v1.5 — 1024 dimensions, best open-source quality as of this writing, runs locally. Use this if you need maximum open-source quality.
Critically: you must use the same embedding model for both indexing and querying. If you index with OpenAI embeddings and query with a different model, the vectors are in incompatible spaces and search will be random noise.
Implementation Option 1: pgvector (Postgres)
pgvector adds a vector data type and similarity search operators to Postgres. If you already use Postgres, this is the simplest path.
Install the extension and create a table:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding VECTOR(1536),
metadata JSONB
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
The HNSW index makes nearest-neighbor search fast even with millions of rows. Without it, each query does a full table scan.
Insert a document (in Python):
import openai
import psycopg2
client = openai.OpenAI()
def embed(text):
response = client.embeddings.create(input=text, model="text-embedding-3-small")
return response.data[0].embedding
embedding = embed("The new MacBook Pro features the M3 chip.")
cursor.execute(
"INSERT INTO documents (content, embedding) VALUES (%s, %s)",
("The new MacBook Pro features the M3 chip.", embedding)
)
Search:
query_embedding = embed("best apple laptop")
cursor.execute(
"SELECT content, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT 10",
(query_embedding, query_embedding)
)
results = cursor.fetchall()
The <=> operator computes cosine distance (1 - cosine similarity). Ordering by it ascending gives you most similar results first.
Implementation Option 2: ChromaDB (Fully Local)
ChromaDB is an open-source vector database that runs entirely locally with no external dependencies. It handles embedding, storage, and retrieval in one package.
import chromadb
from sentence_transformers import SentenceTransformer
client = chromadb.Client()
collection = client.create_collection("documents")
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"The new MacBook Pro features the M3 chip.",
"Affordable notebooks for everyday computing tasks.",
"Budget-friendly laptops for students and remote workers.",
]
embeddings = model.encode(documents).tolist()
collection.add(
embeddings=embeddings,
documents=documents,
ids=["doc1", "doc2", "doc3"]
)
query_embedding = model.encode(["cheap laptop"]).tolist()
results = collection.query(query_embeddings=query_embedding, n_results=5)
ChromaDB can persist to disk (chromadb.PersistentClient(path="./chroma_db")) and scales to millions of documents on a single machine with the right indexing configuration.
Hybrid Search: Combining Semantic and Keyword
Pure semantic search is not always better than keyword search. Exact keyword matches matter for proper nouns (product names, people's names, IDs), technical terms, and queries where the user knows the exact phrasing.
Hybrid search combines both signals. The standard approach is Reciprocal Rank Fusion (RRF):
- Run semantic search, get a ranked list of results.
- Run keyword search (BM25 or Postgres full-text search), get another ranked list.
- Combine the rankings using RRF: for each document, its RRF score is the sum of 1/(rank + k) across all result lists (k is typically 60).
- Re-rank by RRF score.
RRF is remarkably robust. It does not require tuning the relative weights of semantic vs keyword search — the rank combination naturally handles cases where one method has a clear winner.
In Postgres, the hybrid query looks like:
WITH semantic AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> ${query_vector}::vector) AS rank
FROM documents ORDER BY embedding <=> ${query_vector}::vector LIMIT 50
),
keyword AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), query) DESC) AS rank
FROM documents, to_tsquery('english', ${keyword_query}) query
WHERE to_tsvector('english', content) @@ query LIMIT 50
)
SELECT COALESCE(s.id, k.id) AS id,
1.0/(60 + COALESCE(s.rank, 1000)) + 1.0/(60 + COALESCE(k.rank, 1000)) AS rrf_score
FROM semantic s FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY rrf_score DESC LIMIT 10;
Re-Ranking With a Cross-Encoder
Bi-encoder models (the ones that produce independent embeddings for queries and documents) are fast but sacrifice some precision. A cross-encoder reads the query and document together, producing a relevance score that is more accurate but much slower.
The two-stage approach: use the bi-encoder to retrieve the top 50-100 candidates cheaply, then use the cross-encoder to re-rank just those candidates precisely. This gives you the speed of bi-encoder retrieval with the precision of cross-encoder scoring.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in candidate_documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidate_documents, scores), key=lambda x: x[1], reverse=True)
top_results = [doc for doc, score in ranked[:10]]
Re-ranking improves relevance noticeably, especially for long documents where the relevant content is buried. The overhead is typically 50-200ms for 100 candidates, acceptable for most applications.
Chunking Documents for Indexing
Embedding models have token limits (typically 512-8,192 tokens). Long documents must be split into chunks before embedding. Chunking strategy affects search quality significantly:
Fixed-size chunking — split every N tokens with overlap. Simple but can split mid-sentence. Use 256-512 token chunks with 50 token overlap.
Sentence-based chunking — split on sentence boundaries. Cleaner but chunks vary in size.
Semantic chunking — split when the topic changes (detected by comparing embeddings of consecutive sentences). Best quality but more complex to implement.
When a chunk matches a query, return the surrounding context (the parent document or adjacent chunks) rather than just the matching chunk. This provides better context for downstream use (especially in RAG systems).
Keep Reading
- RAG Implementation Guide — semantic search as the retrieval layer in RAG
- Vector Databases Explained — how Pinecone, Weaviate, and Qdrant work internally
- How Large Language Models Work — the transformer architecture behind embedding models
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.