The choice between context stuffing (putting all your knowledge directly in the context window) and retrieval-augmented generation (fetching relevant chunks at query time) is one of the most consequential architectural decisions in LLM applications. The right answer depends on the size, update frequency, and access pattern of your knowledge base.
Context Stuffing: What It Is and When It Works
Context stuffing means including the full knowledge base — or a large portion of it — directly in the prompt. If you have 200 pages of product documentation and every question might require any part of it, you include all 200 pages in the context.
This sounds wasteful, but it has a decisive advantage: the model can see everything at once. No retrieval step means no retrieval errors. The model can reason across sections, follow cross-references, and find relevant information even when the query does not contain the right keywords to retrieve it.
Context stuffing works well when:
- Your knowledge base fits comfortably within the context window (under ~100k tokens as a practical limit)
- Your knowledge base is relatively static (updates are infrequent)
- Queries require reasoning across multiple sections of the knowledge base
- Retrieval errors would be costly (missing a relevant section produces a wrong answer)
- Latency from retrieval is a concern
For modern models with 128k-200k context windows, a lot of knowledge bases fit. 100k tokens is roughly 75,000 words — about 300 pages of dense documentation. Many product knowledge bases, legal documents, and internal wikis are smaller than this.
The Cost Reality of Context Stuffing
The most common objection to context stuffing is cost. Sending 300 pages in every request sounds expensive. Let's compute it concretely.
Assume a 100k-token knowledge base queried 1,000 times per day with GPT-4o at $2.50 per million input tokens:
- Daily input token cost: 100,000 tokens x 1,000 queries x $0.0000025 = $250/day
- Monthly: ~$7,500
Now compare to RAG retrieving 5 chunks of 500 tokens each:
- Daily input token cost: 2,500 tokens x 1,000 queries x $0.0000025 = $6.25/day
- Monthly: ~$187
Plus RAG infrastructure costs: embedding model API calls, vector database (Pinecone starts at $70/month for production, Weaviate Cloud similar), retrieval latency adds 100-500ms per query.
The cost difference is real and significant at scale. But the right comparison is not cost alone — it is cost adjusted for accuracy. If RAG misses the relevant chunk 10% of the time and produces wrong answers, the cost of those errors may exceed the savings from cheaper tokens.
Practical rule: Use context stuffing when your knowledge base is under 100k tokens and query volume is under a few hundred per day. Switch to RAG when cost at scale becomes prohibitive or when the knowledge base outgrows the context window.
When RAG Wins
RAG is the right choice when:
The knowledge base is large: 500+ pages of documentation, a database of 10,000 support tickets, a corpus of legal cases. This does not fit in context.
The knowledge base changes frequently: If you update content daily, re-embedding changed documents is cheaper than re-writing your entire context every time.
Queries are narrow: Most questions require only a small fraction of the knowledge base. Retrieving 5 relevant chunks is sufficient. The cost savings are real without significant accuracy loss.
You need source attribution: RAG naturally produces citations (the retrieved chunks). Context stuffing can cite sources but requires explicit prompting and may hallucinate section references.
Chunking Strategy: Too Small vs Too Large
The biggest technical decision in RAG is chunking — how you divide your source documents into retrievable pieces. Bad chunking is the most common cause of RAG failures.
Too-small chunks (under 100 tokens): Individual sentences or short paragraphs. These often lack context. Retrieving "The limit is 500" tells you nothing without knowing what has a limit of 500. The model cannot answer accurately because the chunk does not contain enough information.
Too-large chunks (over 1,000 tokens): Dense sections that contain multiple topics. The embedding vector averages over too many concepts, making retrieval less precise. You pay more tokens per retrieved chunk and the model has to dig through irrelevant content to find the answer.
Practical starting point: 300-500 tokens per chunk, with 50-100 token overlap between adjacent chunks. The overlap ensures that information spanning a chunk boundary is not lost.
Semantic chunking: Instead of splitting by token count, split at natural semantic boundaries (paragraphs, section headers, logical units). This requires more preprocessing but produces better retrieval results because each chunk represents a coherent concept.
def chunk_by_section(text: str, max_tokens: int = 400) -> list[str]:
"""Split text at double newlines (paragraph breaks), merging until max_tokens."""
paragraphs = text.split("\n\n")
chunks = []
current_chunk = []
current_tokens = 0
for para in paragraphs:
para_tokens = len(para.split()) * 1.3 # rough token estimate
if current_tokens + para_tokens > max_tokens and current_chunk:
chunks.append("\n\n".join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunks
The Hybrid Approach
Many production systems use both: a small, stable core knowledge base in context (100-500 tokens of key facts, system information, or user profile) plus RAG for the larger dynamic knowledge base.
System context (always included):
- User's account details and preferences (200 tokens)
- Product pricing and plan information (100 tokens)
- Key policies that affect all answers (200 tokens)
Retrieved context (per query):
- 3-5 chunks from documentation database most relevant to the query (1,500 tokens)
This hybrid gives you the predictability of context stuffing for the most important always-relevant information, combined with the scale of RAG for the larger knowledge base.
Retrieval Quality: The Hidden Variable
RAG accuracy depends heavily on retrieval quality. If the retriever returns the wrong chunks, even a perfect generator produces a wrong answer. Common retrieval failures:
Keyword mismatch: The user asks about "can I get a refund?" and the documentation says "cancellation and reimbursement policy." The semantic meaning matches but the keywords do not, and a naive keyword-based retriever misses it. Fix: use dense vector retrieval (embeddings) instead of keyword search, or combine both (hybrid search).
Missing context: The relevant chunk says "refer to section 4.2 for limits." Section 4.2 is not retrieved. Fix: retrieve adjacent chunks when retrieving a document, or increase chunk overlap.
Over-retrieval: Returning too many chunks fills the context with tangentially related information. The model hedges or hallucinates because it is trying to reconcile conflicting or irrelevant chunks. Fix: retrieve fewer, higher-confidence chunks and set a minimum similarity threshold.
Evaluating Your RAG Pipeline
Two separate evaluations are needed:
-
Retrieval evaluation: Given a query, does the retriever return the chunk that contains the answer? Measure recall@k (does the correct chunk appear in the top k results?).
-
End-to-end evaluation: Given a query, does the full pipeline (retrieve + generate) return the correct answer? This is the metric that matters to users, but it conflates retrieval and generation errors.
Measuring both separately lets you diagnose whether failures are in retrieval (wrong chunks returned) or generation (correct chunks returned but wrong answer generated).
Keep Reading
- The Complete Prompt Engineering Guide (2026) — foundational prompt techniques for working with retrieved context
- Structured Output Prompting Guide — getting source citations from RAG responses in structured format
- How Large Language Models Work — understanding context windows and attention mechanisms
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.