Building a RAG System With Open Source Tools: A Practical Guide

How to build a retrieval-augmented generation system using Ollama, ChromaDB, and Sentence Transformers. When open source RAG beats paid options.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#rag#chromadb#ollama#llamaindex#open-source-ai#vector-database

FIG. ART-27

9 min read

“

Building a RAG System With Open Source Tools: A Practical Guide

// reading plan

sections

1,199

words

min read

// Open Source AI

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

OpenCode runs Claude, GPT, Gemini, or local Ollama models in one terminal agent — Claude Code is official, polished, and Anthropic-native. Honest 2026 comparison.

5 min read

// Open Source AI

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

Step 1: Install Dependencies

pip install chromadb sentence-transformers llama-index llama-index-llms-ollama llama-index-embeddings-huggingface

Make sure Ollama is running (ollama serve) with your chosen model pulled:

ollama pull llama3.3

Step 2: Index Your Documents

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Load your documents from a folder
documents = SimpleDirectoryReader("./docs").load_data()

# Set up embedding model (runs locally, no API key needed)
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Set up local LLM via Ollama
llm = Ollama(model="llama3.3", request_timeout=120.0)

# Create and store the index
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
    llm=llm
)

print(f"Indexed {len(documents)} documents")

This reads every file in ./docs, splits them into chunks, generates embeddings with Sentence Transformers, and stores them in ChromaDB. The index persists to disk in ./chroma_db/.

Step 3: Query the System

from llama_index.core import load_index_from_storage, StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Load the existing index
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_collection("my_documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
llm = Ollama(model="llama3.3", request_timeout=120.0)

index = load_index_from_storage(
    storage_context,
    embed_model=embed_model,
    llm=llm
)

# Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is our policy on remote work?")
print(response)

The query engine embeds your question, retrieves the 3 most relevant document chunks from ChromaDB, and passes them to Ollama along with your question. The LLM answers using the retrieved context.

How Retrieval Works Under the Hood

When you index a document, it is split into chunks (typically 512-1024 tokens each). Each chunk is converted to a numerical vector (the embedding) by the Sentence Transformers model. This vector captures the semantic meaning of the text.

When you submit a query, the same embedding model converts your question into a vector. ChromaDB finds the stored document chunks whose vectors are closest to your query vector (cosine similarity). These chunks are inserted into the LLM's context window as reference material.

The quality of retrieval depends on two things: the quality of the embedding model and the chunking strategy. Better embedding models (like all-mpnet-base-v2 or OpenAI's text-embedding-3-small) find more semantically relevant chunks. Smarter chunking (respecting sentence boundaries, headers, and logical sections) produces chunks that are more coherent when retrieved.

When Open Source RAG Beats Paid Options

Privacy requirements. If your documents contain sensitive data, health information, legal documents, or trade secrets, sending them to OpenAI or Anthropic for indexing and querying may violate data handling policies. The open source stack keeps everything on your infrastructure.

Cost at scale. OpenAI's file search charges per retrieval. For a team doing 10,000 queries per day against internal documentation, OpenAI Assistants API costs can reach $100-500/month. A self-hosted open source RAG stack at that scale runs on a $50-100/month cloud instance with no per-query cost.

Document update frequency. When your documents change frequently, you want control over re-indexing. Self-hosted ChromaDB gives you direct control. You can re-index specific documents, delete stale chunks, and update the vector store without going through a third-party API.

Offline and air-gapped environments. For government, defense, or other security-constrained deployments, the open source stack can run entirely without internet access.

Real Limitations of Open Source RAG

Embedding quality gap. OpenAI's text-embedding-3-small and text-embedding-3-large produce better embeddings than all-MiniLM-L6-v2 for most tasks. Better embeddings mean more relevant chunk retrieval, which means better answers. The quality difference is meaningful for long, complex documents or documents with nuanced semantic relationships. Approximate improvement: OpenAI embeddings reduce retrieval errors by 15-30% compared to all-MiniLM-L6-v2 on MTEB benchmarks (Muennighoff et al., MTEB leaderboard, 2024).

Setup complexity. Getting a production-quality RAG system right takes real effort. Chunking strategy, embedding model selection, retrieval top-k, re-ranking, and prompt engineering all affect quality. Paid services (OpenAI Assistants, Azure AI Search) abstract most of this. Open source requires you to understand and tune each layer.

Latency for large document sets. ChromaDB is a pure Python vector database. For very large document sets (millions of chunks), it becomes slow. Production deployments at scale benefit from dedicated vector databases like Qdrant or Weaviate, which are also open source but require more operational overhead.

LLM quality ceiling. The quality of the final answer is limited by the LLM you run locally. For a RAG system over technical documentation, a local 7B model may miss nuances that GPT-4o would catch. Using Ollama with Llama 3.3 70B (if you have the hardware) closes most of this gap.

Keep Reading

Ollama Complete Guide 2026 - Set up the local LLM that powers this stack
Best Local LLM in 2026 - Which model to use as the generator in your RAG pipeline
Cutting LLM API Costs by 50%+ - When to use cloud APIs and how to reduce costs when you do

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Building a RAG System With Open Source Tools: A Practical Guide

Related Articles

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

What RAG Is and When You Need It

The Open Source RAG Stack

Step 1: Install Dependencies

Step 2: Index Your Documents

Step 3: Query the System

How Retrieval Works Under the Hood

When Open Source RAG Beats Paid Options

Real Limitations of Open Source RAG

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

When to Fine-Tune an LLM (And When to Rely on RAG Instead)

Building a RAG System With Open Source Tools: A Practical Guide

Related Articles

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

What RAG Is and When You Need It

The Open Source RAG Stack

Step 1: Install Dependencies

Step 2: Index Your Documents

Step 3: Query the System

How Retrieval Works Under the Hood

When Open Source RAG Beats Paid Options

Real Limitations of Open Source RAG

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

When to Fine-Tune an LLM (And When to Rely on RAG Instead)

The workspace your team
actually needs