Retrieval-Augmented Generation: The 2020 Paper That Changed How We Build LLM Apps

The RAG paper by Lewis et al. introduced the idea of combining a retrieval system with a generative model, creating the foundation for every modern LLM application that grounds responses in documents.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 13, 2026

9 min read

// tags

#rag#retrieval#knowledge-grounding#dpr#bart

FIG. ART-29

9 min read

“

Retrieval-Augmented Generation: The 2020 Paper That Changed How We Build LLM Apps

// reading plan

sections

412

words

min read

// Prompt Engineering

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Customer support AI fails in predictable ways. The right system prompt prevents most of them. Here are the patterns that work and the mistakes that create problems.

9 min read

// Prompt Engineering

Context Stuffing vs RAG: When to Put Everything in Context

Before RAG: The Knowledge Problem

Large language models encode knowledge in their parameters during pretraining, but this knowledge has a cutoff date, can be hallucinated with false confidence, and cannot be updated without expensive retraining. The RAG paper (arXiv:2005.11401) by Patrick Lewis and colleagues at Facebook AI Research (2020) introduced a cleaner solution: separate knowledge storage from the language model entirely.

The RAG Architecture

RAG combines two components:

Retriever: Given a query, find the most relevant documents from a large corpus using dense vector search. The paper uses Dense Passage Retrieval (DPR), which encodes queries and passages with separate BERT models and finds nearest neighbors in embedding space.
Generator: Given the query and the retrieved documents, generate the answer. The paper uses BART, a sequence-to-sequence Transformer, which conditions its generation on both the query and the evidence passages.

The retriever and generator are trained jointly end-to-end, with the retrieval probabilities marginalized over in the loss function.

RAG-Sequence vs RAG-Token

The paper introduces two variants:

RAG-Sequence retrieves a single set of documents per query, then generates the complete answer conditioned on those documents. This is conceptually simpler and works well for most QA tasks.

RAG-Token retrieves a fresh set of documents at each generation step, potentially switching knowledge sources mid-sequence. This is more flexible but computationally expensive.

Dense Passage Retrieval (DPR)

DPR encodes passages into a shared embedding space using BERT. At query time, the query is encoded and the k nearest passage embeddings are found using FAISS (Facebook's efficient similarity search library). This replaces BM25 sparse retrieval with learned dense representations that capture semantic similarity rather than just keyword overlap.

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
)
model = RagTokenForGeneration.from_pretrained(
    "facebook/rag-token-nq", retriever=retriever
)

inputs = tokenizer("What is the capital of France?", return_tensors="pt")
generated = model.generate(**inputs)
print(tokenizer.batch_decode(generated, skip_special_tokens=True))

Benchmark Results

On NaturalQuestions and TriviaQA open-domain QA benchmarks, RAG outperformed both pure parametric models (T5 with no retrieval) and pure retrieval models. It set state-of-the-art results on several knowledge-intensive tasks in 2020, demonstrating that combining retrieval with generation was more effective than scaling either component alone.

Evolution to Modern RAG Stacks

The original RAG paper used DPR + BART. Modern stacks replace these with:

Retriever: OpenAI/Cohere/local embedding models + Pinecone/Weaviate/Chroma vector databases
Generator: GPT-4, Claude, Llama
Frameworks: LangChain, LlamaIndex

The core insight — ground generation in retrieved evidence — remains unchanged.

Retrieval-Augmented Generation: The 2020 Paper That Changed How We Build LLM Apps

Related Articles

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Context Stuffing vs RAG: When to Put Everything in Context

Before RAG: The Knowledge Problem

The RAG Architecture

RAG-Sequence vs RAG-Token

Dense Passage Retrieval (DPR)

Benchmark Results

Evolution to Modern RAG Stacks

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

Retrieval-Augmented Generation: The 2020 Paper That Changed How We Build LLM Apps

Related Articles

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Context Stuffing vs RAG: When to Put Everything in Context

Before RAG: The Knowledge Problem

The RAG Architecture

RAG-Sequence vs RAG-Token

Dense Passage Retrieval (DPR)

Benchmark Results

Evolution to Modern RAG Stacks

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

The workspace your team
actually needs