A RAG (Retrieval Augmented Generation) system gives an LLM access to specific documents at query time, instead of relying on knowledge baked into the model's weights during training. The model reads the relevant documents on demand and answers the question based on what it just read. This means your LLM can answer questions about internal documentation, recent events, proprietary data, or any knowledge that was not in its training set, and it can do so accurately because you are providing the source documents rather than relying on memory.
A complete RAG system has five steps: chunking documents into pieces, creating embeddings of those chunks, storing the embeddings in a vector database, retrieving relevant chunks at query time, and evaluating the quality of what you built.
Step 1: Chunk Your Documents
Chunking is breaking your documents into pieces small enough to fit in the retrieval context but large enough to contain complete, useful information.
Why chunk size matters: if chunks are too small, retrieved chunks may not contain enough context to answer a question. If chunks are too large, you retrieve more tokens than the context window can handle and dilute the retrieval with less relevant content.
A starting point for most text documents: 500 to 1,000 tokens per chunk with 100-token overlap between chunks. The overlap prevents losing information that spans a chunk boundary.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # target chunk size in characters
chunk_overlap=100, # overlap between chunks
separators=["
", "
", " ", ""] # prefer splitting on paragraph, then line, then word
)
with open("your_document.txt", "r") as f:
text = f.read()
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
For PDFs, use a library like pypdf or pdfplumber to extract text first, then chunk. For structured content like markdown, consider chunking at heading boundaries rather than character count. Matching your chunking strategy to your document structure improves retrieval quality.
Step 2: Create Embeddings
An embedding is a numerical vector that represents the semantic meaning of a chunk of text. Two chunks with similar meaning produce similar vectors. This is what enables similarity search: find chunks whose meaning is similar to the query's meaning.
Embedding model options:
OpenAI text-embedding-3-small: the cheapest good embedding model from OpenAI. Produces 1,536-dimensional vectors. Costs approximately $0.02 per million tokens (May 2026). Best for production applications where you are already using OpenAI.
Sentence Transformers all-MiniLM-L6-v2: free, runs locally, produces 384-dimensional vectors. Quality is lower than text-embedding-3-small but sufficient for many use cases and has zero API cost.
# Option 1: OpenAI embeddings (paid)
from openai import OpenAI
client = OpenAI()
def get_openai_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Option 2: Sentence Transformers (free, local)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def get_local_embedding(text: str) -> list[float]:
return model.encode(text).tolist()
Create embeddings for all your chunks. This is the most time-consuming part of the initial setup for large document collections.
Step 3: Store in a Vector Database
A vector database stores your embeddings alongside the original text and enables fast similarity search.
ChromaDB is the simplest option for local development and small projects. No separate service to run, in-memory or file-backed storage, and straightforward Python API.
import chromadb
# Create or connect to a persistent database
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # use cosine similarity
)
# Add chunks with their embeddings
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
collection.add(
documents=[chunk],
embeddings=[embedding],
ids=[f"chunk_{i}"],
metadatas=[{"source": "your_document.txt", "chunk_index": i}]
)
print(f"Stored {collection.count()} chunks")
For production with larger document sets (millions of chunks), use a managed vector database like Pinecone or Weaviate. For Postgres-based infrastructure, pgvector is a practical option that keeps everything in one database.
Step 4: Retrieve and Generate
At query time: embed the user's question, find the most similar chunks, add them to the prompt, and ask the model to answer.
from openai import OpenAI
client = OpenAI()
def rag_query(question: str, collection, top_k: int = 5) -> str:
# 1. Embed the question
question_embedding = get_openai_embedding(question)
# 2. Retrieve the top-k most similar chunks
results = collection.query(
query_embeddings=[question_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
retrieved_chunks = results["documents"][0]
context = "
---
".join(retrieved_chunks)
# 3. Build the prompt with retrieved context
prompt = f"""Answer the question based only on the following context.
If the context does not contain enough information to answer the question,
say "I don't have enough information to answer this question."
Context:
{context}
Question: {question}
Answer:"""
# 4. Generate the answer
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0 # deterministic for factual Q&A
)
return response.choices[0].message.content
# Usage
answer = rag_query("What are the refund policy terms?", collection)
print(answer)
The "only use the context" instruction is important. Without it, the model may blend retrieved information with its training knowledge, which can produce plausible but inaccurate answers.
Step 5: Evaluate Quality with RAGAS
Building RAG without evaluation means you do not know if it is working. The RAGAS framework (ragas.io) provides four metrics specifically designed for RAG evaluation.
Faithfulness: does the answer use only information from the retrieved context? (Measures hallucination)
Answer Relevance: does the answer address the question? (Measures response quality)
Context Recall: did retrieval find the chunks that contain the information needed to answer? (Measures retrieval quality)
Context Precision: are the retrieved chunks relevant, or are irrelevant chunks being retrieved? (Measures retrieval precision)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["What is the refund window?"],
"answer": [generated_answer],
"contexts": [retrieved_chunks],
"ground_truth": ["Refunds must be requested within 30 days of purchase."]
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(result)
Run RAGAS on 20 to 50 representative questions with known correct answers. Scores below 0.7 on any metric indicate a specific problem: low context recall means retrieval is failing, low faithfulness means the model is hallucinating beyond the retrieved context.
Common Failure Modes
Bad chunking. Chunks that split mid-sentence or mid-concept. The retrieved chunk contains the beginning of an answer but not the end, producing incomplete or wrong answers. Fix: adjust chunk size and overlap, prefer splitting on natural boundaries.
Too many chunks in context. Retrieving top-15 instead of top-5 fills the context window with marginally relevant material that dilutes the actually relevant content. Fix: start with top-3 to top-5, only increase if recall is genuinely low.
Embedding mismatch. Embedding documents with one model and queries with another. The semantic spaces are different, so similarity search produces poor results. Fix: use the same model for both documents and queries. Always.
No re-ranking. The top-k by embedding similarity may not be the top-k by actual relevance. Adding a cross-encoder re-ranker (Cohere Rerank API, or local cross-encoder models) as a second pass significantly improves precision.
Keep Reading
- RAG vs Fine-Tuning: Which One Does Your Application Actually Need? — When RAG is the right choice versus fine-tuning
- Vector Databases Explained: What They Are and When to Use Them — Deeper comparison of ChromaDB, Pinecone, Weaviate, and pgvector
- LLM Context Management: How to Handle Long Conversations Without Losing Quality — Using RAG for long-term conversation memory
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.