A retrieval-augmented generation (RAG) system lets an LLM answer questions about documents it was not trained on by retrieving relevant excerpts at query time. The open source stack for this is mature in 2026: Ollama for the LLM, ChromaDB for the vector store, Sentence Transformers for embeddings, and either LangChain or LlamaIndex for orchestration. The whole stack runs on a laptop, costs nothing per query, and keeps your documents on your own infrastructure.
This guide explains when you need RAG, how to build it with open source tools, and what the real limitations are compared to paid alternatives like OpenAI's file search or Azure AI Search.
What RAG Is and When You Need It
An LLM has a training data cutoff and a finite context window. If you want it to answer questions about your company's internal documentation, your codebase, or a document set that changes frequently, you have two options:
Fine-tuning: Train the model on your documents. Expensive, time-consuming, and produces a model that encodes knowledge statically. When your documents update, you fine-tune again.
RAG: At query time, retrieve the most relevant document chunks from a database and include them in the prompt context. The LLM answers using the retrieved text plus its base training. Documents stay in the retrieval database and update independently from the model.
RAG is almost always the right choice for question-answering over a document set. It is cheaper, faster to update, and more reliable than fine-tuning for factual retrieval.
When RAG is not the answer: If you need the model to learn a new style, domain vocabulary, or task pattern (not just look up facts), fine-tuning may be more appropriate.
The Open Source RAG Stack
Ollama — runs the LLM locally (Llama 3.3, Mistral, Qwen 2.5, etc.)
ChromaDB — open source vector database that stores your document embeddings and handles similarity search
Sentence Transformers — Python library that generates text embeddings from your documents. The all-MiniLM-L6-v2 model is the standard starting point: small (80MB), fast, and good enough for most use cases.
LangChain or LlamaIndex — orchestration frameworks that connect the pieces. LangChain has more ecosystem integrations; LlamaIndex has better built-in RAG primitives. Both work. I use LlamaIndex for RAG-specific projects.
Step 1: Install Dependencies
pip install chromadb sentence-transformers llama-index llama-index-llms-ollama llama-index-embeddings-huggingface
Make sure Ollama is running (ollama serve) with your chosen model pulled:
ollama pull llama3.3
Step 2: Index Your Documents
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("my_documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Load your documents from a folder
documents = SimpleDirectoryReader("./docs").load_data()
# Set up embedding model (runs locally, no API key needed)
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Set up local LLM via Ollama
llm = Ollama(model="llama3.3", request_timeout=120.0)
# Create and store the index
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
llm=llm
)
print(f"Indexed {len(documents)} documents")
This reads every file in ./docs, splits them into chunks, generates embeddings with Sentence Transformers, and stores them in ChromaDB. The index persists to disk in ./chroma_db/.
Step 3: Query the System
from llama_index.core import load_index_from_storage, StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
# Load the existing index
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_collection("my_documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
llm = Ollama(model="llama3.3", request_timeout=120.0)
index = load_index_from_storage(
storage_context,
embed_model=embed_model,
llm=llm
)
# Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What is our policy on remote work?")
print(response)
The query engine embeds your question, retrieves the 3 most relevant document chunks from ChromaDB, and passes them to Ollama along with your question. The LLM answers using the retrieved context.
How Retrieval Works Under the Hood
When you index a document, it is split into chunks (typically 512-1024 tokens each). Each chunk is converted to a numerical vector (the embedding) by the Sentence Transformers model. This vector captures the semantic meaning of the text.
When you submit a query, the same embedding model converts your question into a vector. ChromaDB finds the stored document chunks whose vectors are closest to your query vector (cosine similarity). These chunks are inserted into the LLM's context window as reference material.
The quality of retrieval depends on two things: the quality of the embedding model and the chunking strategy. Better embedding models (like all-mpnet-base-v2 or OpenAI's text-embedding-3-small) find more semantically relevant chunks. Smarter chunking (respecting sentence boundaries, headers, and logical sections) produces chunks that are more coherent when retrieved.
When Open Source RAG Beats Paid Options
Privacy requirements. If your documents contain sensitive data, health information, legal documents, or trade secrets, sending them to OpenAI or Anthropic for indexing and querying may violate data handling policies. The open source stack keeps everything on your infrastructure.
Cost at scale. OpenAI's file search charges per retrieval. For a team doing 10,000 queries per day against internal documentation, OpenAI Assistants API costs can reach $100-500/month. A self-hosted open source RAG stack at that scale runs on a $50-100/month cloud instance with no per-query cost.
Document update frequency. When your documents change frequently, you want control over re-indexing. Self-hosted ChromaDB gives you direct control. You can re-index specific documents, delete stale chunks, and update the vector store without going through a third-party API.
Offline and air-gapped environments. For government, defense, or other security-constrained deployments, the open source stack can run entirely without internet access.
Real Limitations of Open Source RAG
Embedding quality gap. OpenAI's text-embedding-3-small and text-embedding-3-large produce better embeddings than all-MiniLM-L6-v2 for most tasks. Better embeddings mean more relevant chunk retrieval, which means better answers. The quality difference is meaningful for long, complex documents or documents with nuanced semantic relationships. Approximate improvement: OpenAI embeddings reduce retrieval errors by 15-30% compared to all-MiniLM-L6-v2 on MTEB benchmarks (Muennighoff et al., MTEB leaderboard, 2024).
Setup complexity. Getting a production-quality RAG system right takes real effort. Chunking strategy, embedding model selection, retrieval top-k, re-ranking, and prompt engineering all affect quality. Paid services (OpenAI Assistants, Azure AI Search) abstract most of this. Open source requires you to understand and tune each layer.
Latency for large document sets. ChromaDB is a pure Python vector database. For very large document sets (millions of chunks), it becomes slow. Production deployments at scale benefit from dedicated vector databases like Qdrant or Weaviate, which are also open source but require more operational overhead.
LLM quality ceiling. The quality of the final answer is limited by the LLM you run locally. For a RAG system over technical documentation, a local 7B model may miss nuances that GPT-4o would catch. Using Ollama with Llama 3.3 70B (if you have the hardware) closes most of this gap.
Keep Reading
- Ollama Complete Guide 2026 — Set up the local LLM that powers this stack
- Best Local LLM in 2026 — Which model to use as the generator in your RAG pipeline
- Cutting LLM API Costs by 50%+ — When to use cloud APIs and how to reduce costs when you do
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.