Building a RAG System From Scratch: A Complete Implementation Guide

RAG retrieves relevant documents at query time and adds them to the prompt. Five steps: chunk, embed, store, retrieve, evaluate. Here is the complete implementation.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

11 min read

// tags

#rag#vector-database#chromadb#langchain#embeddings

FIG. ART-24

11 min read

“

Building a RAG System From Scratch: A Complete Implementation Guide

// reading plan

sections

1,196

words

min read

// Machine Learning

Ensemble Methods: Why Combining Models Beats Any Individual Model

Bagging, boosting, and stacking -- ensemble methods consistently win Kaggle competitions and improve production accuracy. Here is how each works and when to use them.

9 min read

// Machine Learning

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

A RAG (Retrieval Augmented Generation) system gives an LLM access to specific documents at query time, instead of relying on knowledge baked into the model's weights during training. The model reads the relevant documents on demand and answers the question based on what it just read. This means your LLM can answer questions about internal documentation, recent events, proprietary data, or any knowledge that was not in its training set, and it can do so accurately because you are providing the source documents rather than relying on memory.

A complete RAG system has five steps: chunking documents into pieces, creating embeddings of those chunks, storing the embeddings in a vector database, retrieving relevant chunks at query time, and evaluating the quality of what you built.

Step 1: Chunk Your Documents

Chunking is breaking your documents into pieces small enough to fit in the retrieval context but large enough to contain complete, useful information.

Why chunk size matters: if chunks are too small, retrieved chunks may not contain enough context to answer a question. If chunks are too large, you retrieve more tokens than the context window can handle and dilute the retrieval with less relevant content.

A starting point for most text documents: 500 to 1,000 tokens per chunk with 100-token overlap between chunks. The overlap prevents losing information that spans a chunk boundary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,        # target chunk size in characters
    chunk_overlap=100,     # overlap between chunks
    separators=["

", "
", " ", ""]  # prefer splitting on paragraph, then line, then word
)

with open("your_document.txt", "r") as f:
    text = f.read()

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")

For PDFs, use a library like pypdf or pdfplumber to extract text first, then chunk. For structured content like markdown, consider chunking at heading boundaries rather than character count. Matching your chunking strategy to your document structure improves retrieval quality.

Step 2: Create Embeddings

An embedding is a numerical vector that represents the semantic meaning of a chunk of text. Two chunks with similar meaning produce similar vectors. This is what enables similarity search: find chunks whose meaning is similar to the query's meaning.

Embedding model options:

OpenAI text-embedding-3-small: the cheapest good embedding model from OpenAI. Produces 1,536-dimensional vectors. Costs approximately $0.02 per million tokens (May 2026). Best for production applications where you are already using OpenAI.

Sentence Transformers all-MiniLM-L6-v2: free, runs locally, produces 384-dimensional vectors. Quality is lower than text-embedding-3-small but sufficient for many use cases and has zero API cost.

# Option 1: OpenAI embeddings (paid)
from openai import OpenAI

client = OpenAI()

def get_openai_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Option 2: Sentence Transformers (free, local)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def get_local_embedding(text: str) -> list[float]:
    return model.encode(text).tolist()

Create embeddings for all your chunks. This is the most time-consuming part of the initial setup for large document collections.

Step 3: Store in a Vector Database

A vector database stores your embeddings alongside the original text and enables fast similarity search.

ChromaDB is the simplest option for local development and small projects. No separate service to run, in-memory or file-backed storage, and straightforward Python API.

import chromadb

# Create or connect to a persistent database
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}  # use cosine similarity
)

# Add chunks with their embeddings
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    collection.add(
        documents=[chunk],
        embeddings=[embedding],
        ids=[f"chunk_{i}"],
        metadatas=[{"source": "your_document.txt", "chunk_index": i}]
    )

print(f"Stored {collection.count()} chunks")

For production with larger document sets (millions of chunks), use a managed vector database like Pinecone or Weaviate. For Postgres-based infrastructure, pgvector is a practical option that keeps everything in one database.

Step 4: Retrieve and Generate

At query time: embed the user's question, find the most similar chunks, add them to the prompt, and ask the model to answer.

from openai import OpenAI

client = OpenAI()

def rag_query(question: str, collection, top_k: int = 5) -> str:
    # 1. Embed the question
    question_embedding = get_openai_embedding(question)

    # 2. Retrieve the top-k most similar chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    retrieved_chunks = results["documents"][0]
    context = "

---

".join(retrieved_chunks)

    # 3. Build the prompt with retrieved context
    prompt = f"""Answer the question based only on the following context.
If the context does not contain enough information to answer the question,
say "I don't have enough information to answer this question."

Context:
{context}

Question: {question}

Answer:"""

    # 4. Generate the answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0  # deterministic for factual Q&A
    )

    return response.choices[0].message.content

# Usage
answer = rag_query("What are the refund policy terms?", collection)
print(answer)

The "only use the context" instruction is important. Without it, the model may blend retrieved information with its training knowledge, which can produce plausible but inaccurate answers.

Step 5: Evaluate Quality with RAGAS

Building RAG without evaluation means you do not know if it is working. The RAGAS framework (ragas.io) provides four metrics specifically designed for RAG evaluation.

Faithfulness: does the answer use only information from the retrieved context? (Measures hallucination)

Answer Relevance: does the answer address the question? (Measures response quality)

Context Recall: did retrieval find the chunks that contain the information needed to answer? (Measures retrieval quality)

Context Precision: are the retrieved chunks relevant, or are irrelevant chunks being retrieved? (Measures retrieval precision)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is the refund window?"],
    "answer": [generated_answer],
    "contexts": [retrieved_chunks],
    "ground_truth": ["Refunds must be requested within 30 days of purchase."]
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(result)

Run RAGAS on 20 to 50 representative questions with known correct answers. Scores below 0.7 on any metric indicate a specific problem: low context recall means retrieval is failing, low faithfulness means the model is hallucinating beyond the retrieved context.

Common Failure Modes

Bad chunking. Chunks that split mid-sentence or mid-concept. The retrieved chunk contains the beginning of an answer but not the end, producing incomplete or wrong answers. Fix: adjust chunk size and overlap, prefer splitting on natural boundaries.

Too many chunks in context. Retrieving top-15 instead of top-5 fills the context window with marginally relevant material that dilutes the actually relevant content. Fix: start with top-3 to top-5, only increase if recall is genuinely low.

Embedding mismatch. Embedding documents with one model and queries with another. The semantic spaces are different, so similarity search produces poor results. Fix: use the same model for both documents and queries. Always.

No re-ranking. The top-k by embedding similarity may not be the top-k by actual relevance. Adding a cross-encoder re-ranker (Cohere Rerank API, or local cross-encoder models) as a second pass significantly improves precision.

Keep Reading

RAG vs Fine-Tuning: Which One Does Your Application Actually Need? — When RAG is the right choice versus fine-tuning
Vector Databases Explained: What They Are and When to Use Them — Deeper comparison of ChromaDB, Pinecone, Weaviate, and pgvector
LLM Context Management: How to Handle Long Conversations Without Losing Quality — Using RAG for long-term conversation memory

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Building a RAG System From Scratch: A Complete Implementation Guide

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Step 1: Chunk Your Documents

Step 2: Create Embeddings

Step 3: Store in a Vector Database

Step 4: Retrieve and Generate

Step 5: Evaluate Quality with RAGAS

Common Failure Modes

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

Building a RAG System From Scratch: A Complete Implementation Guide

Related Articles

Ensemble Methods: Why Combining Models Beats Any Individual Model

The ML Tools Ecosystem in 2026: A Map of What Is Worth Knowing

Step 1: Chunk Your Documents

Step 2: Create Embeddings

Step 3: Store in a Vector Database

Step 4: Retrieve and Generate

Step 5: Evaluate Quality with RAGAS

Common Failure Modes

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Research Papers Every Practitioner Should Know in 2026

The workspace your team
actually needs