Before RAG: The Knowledge Problem
Large language models encode knowledge in their parameters during pretraining, but this knowledge has a cutoff date, can be hallucinated with false confidence, and cannot be updated without expensive retraining. The RAG paper (arXiv:2005.11401) by Patrick Lewis and colleagues at Facebook AI Research (2020) introduced a cleaner solution: separate knowledge storage from the language model entirely.
The RAG Architecture
RAG combines two components:
-
Retriever: Given a query, find the most relevant documents from a large corpus using dense vector search. The paper uses Dense Passage Retrieval (DPR), which encodes queries and passages with separate BERT models and finds nearest neighbors in embedding space.
-
Generator: Given the query and the retrieved documents, generate the answer. The paper uses BART, a sequence-to-sequence Transformer, which conditions its generation on both the query and the evidence passages.
The retriever and generator are trained jointly end-to-end, with the retrieval probabilities marginalized over in the loss function.
RAG-Sequence vs RAG-Token
The paper introduces two variants:
RAG-Sequence retrieves a single set of documents per query, then generates the complete answer conditioned on those documents. This is conceptually simpler and works well for most QA tasks.
RAG-Token retrieves a fresh set of documents at each generation step, potentially switching knowledge sources mid-sequence. This is more flexible but computationally expensive.
Dense Passage Retrieval (DPR)
DPR encodes passages into a shared embedding space using BERT. At query time, the query is encoded and the k nearest passage embeddings are found using FAISS (Facebook's efficient similarity search library). This replaces BM25 sparse retrieval with learned dense representations that capture semantic similarity rather than just keyword overlap.
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained(
"facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
)
model = RagTokenForGeneration.from_pretrained(
"facebook/rag-token-nq", retriever=retriever
)
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.batch_decode(generated, skip_special_tokens=True))
Benchmark Results
On NaturalQuestions and TriviaQA open-domain QA benchmarks, RAG outperformed both pure parametric models (T5 with no retrieval) and pure retrieval models. It set state-of-the-art results on several knowledge-intensive tasks in 2020, demonstrating that combining retrieval with generation was more effective than scaling either component alone.
Evolution to Modern RAG Stacks
The original RAG paper used DPR + BART. Modern stacks replace these with:
- Retriever: OpenAI/Cohere/local embedding models + Pinecone/Weaviate/Chroma vector databases
- Generator: GPT-4, Claude, Llama
- Frameworks: LangChain, LlamaIndex
The core insight — ground generation in retrieved evidence — remains unchanged.