Self-RAG: Teaching LLMs to Decide When to Retrieve

Self-RAG introduces reflection tokens that let the model decide whether retrieval is needed and evaluate passage relevance and citation support, outperforming standard RAG on factuality benchmarks.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 2, 2026

9 min read

// tags

#self-rag#rag#retrieval#reflection-tokens#adaptive

FIG. ART-28

9 min read

“

Self-RAG: Teaching LLMs to Decide When to Retrieve

// reading plan

sections

417

words

min read

// LLMs & Language Models

When to Fine-Tune an LLM (And When to Rely on RAG Instead)

Fine-tuning updates model weights, while RAG inserts context. Learn how to combine these strategies or choose the right one for your data.

9 min read

// LLM & Language Models

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

Four-Stage Training

Self-RAG training has four stages:

Critic training: Train a separate critic model on human-annotated data for each reflection token type (using GPT-4 annotations to scale).
Data augmentation: Use the critic to annotate a large SFT corpus - for each training example, determine whether retrieval would help, annotate retrieved passages with ISREL/ISSUP/ISUSE tokens.
SFT on augmented data: Fine-tune the base language model on the augmented corpus with all reflection tokens interleaved.
Inference: At test time, the model generates RETRIEVE tokens to invoke retrieval, evaluates passages, generates conditioned on selected passages, and outputs self-evaluation tokens alongside the response.

# Pseudocode for Self-RAG inference
def self_rag_generate(model, retriever, query):
    # First, predict whether to retrieve
    retrieve_token = model.predict_retrieve(query)

    if retrieve_token == "Yes":
        passages = retriever.retrieve(query, k=5)
        # Score passage relevance
        relevant_passages = [
            p for p in passages
            if model.predict_isrel(query, p) == "Relevant"
        ]
        # Generate conditioned on relevant passages
        response = model.generate(query, context=relevant_passages)
        # Self-evaluate support and usefulness
        support = model.predict_issup(query, response, relevant_passages)
        score = model.predict_isuse(query, response)
    else:
        # Generate without retrieval
        response = model.generate(query)

    return response

Inference-Time Control

The ISUSE scores enable beam search over multiple candidate responses. At inference time, you can set thresholds: only output responses with ISUSE >= 4, or prefer responses with full citation support. This lets you trade off diversity against factuality without retraining.

Benchmark Results

Self-RAG outperforms ChatGPT on factuality benchmarks (TriviaQA, PopQA, ARC-Challenge, MedQA) without retrieval for simple questions - the model correctly identifies when retrieval is unnecessary. On complex knowledge-intensive tasks, it outperforms standard RAG by selectively retrieving only when beneficial. It generates citations for claims it makes, enabling verification.

Self-RAG: Teaching LLMs to Decide When to Retrieve

Related Articles

When to Fine-Tune an LLM (And When to Rely on RAG Instead)

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

The Problem With Always Retrieving

Reflection Tokens

Four-Stage Training

Inference-Time Control

Benchmark Results

Further Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Context Stuffing vs RAG: When to Put Everything in Context

Self-RAG: Teaching LLMs to Decide When to Retrieve

Related Articles

When to Fine-Tune an LLM (And When to Rely on RAG Instead)

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

The Problem With Always Retrieving

Reflection Tokens

Four-Stage Training

Inference-Time Control

Benchmark Results

Further Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Context Stuffing vs RAG: When to Put Everything in Context

The workspace your team
actually needs