The Problem With Always Retrieving
Standard RAG retrieves documents for every query — even when the model already knows the answer with high confidence. This wastes compute, introduces latency, and can actually hurt performance when retrieved passages are irrelevant or distracting. A question like "What is 2+2?" does not benefit from retrieval. Self-RAG (arXiv:2310.11511) teaches the model to decide for itself.
Reflection Tokens
Self-RAG adds four special token types to the model's vocabulary:
- RETRIEVE: Should retrieval happen for this query? [Yes / No]
- ISREL: Is this retrieved passage relevant to the query? [Relevant / Irrelevant]
- ISSUP: Does the generated text cite this passage accurately? [Fully supported / Partially supported / No support]
- ISUSE: Is this response overall useful? [5-point scale]
These tokens are generated by the model itself as part of the output, enabling dynamic decision-making without any external classifier.
Four-Stage Training
Self-RAG training has four stages:
-
Critic training: Train a separate critic model on human-annotated data for each reflection token type (using GPT-4 annotations to scale).
-
Data augmentation: Use the critic to annotate a large SFT corpus — for each training example, determine whether retrieval would help, annotate retrieved passages with ISREL/ISSUP/ISUSE tokens.
-
SFT on augmented data: Fine-tune the base language model on the augmented corpus with all reflection tokens interleaved.
-
Inference: At test time, the model generates RETRIEVE tokens to invoke retrieval, evaluates passages, generates conditioned on selected passages, and outputs self-evaluation tokens alongside the response.
# Pseudocode for Self-RAG inference
def self_rag_generate(model, retriever, query):
# First, predict whether to retrieve
retrieve_token = model.predict_retrieve(query)
if retrieve_token == "Yes":
passages = retriever.retrieve(query, k=5)
# Score passage relevance
relevant_passages = [
p for p in passages
if model.predict_isrel(query, p) == "Relevant"
]
# Generate conditioned on relevant passages
response = model.generate(query, context=relevant_passages)
# Self-evaluate support and usefulness
support = model.predict_issup(query, response, relevant_passages)
score = model.predict_isuse(query, response)
else:
# Generate without retrieval
response = model.generate(query)
return response
Inference-Time Control
The ISUSE scores enable beam search over multiple candidate responses. At inference time, you can set thresholds: only output responses with ISUSE >= 4, or prefer responses with full citation support. This lets you trade off diversity against factuality without retraining.
Benchmark Results
Self-RAG outperforms ChatGPT on factuality benchmarks (TriviaQA, PopQA, ARC-Challenge, MedQA) without retrieval for simple questions — the model correctly identifies when retrieval is unnecessary. On complex knowledge-intensive tasks, it outperforms standard RAG by selectively retrieving only when beneficial. It generates citations for claims it makes, enabling verification.