Basic RAG works by splitting documents into chunks, embedding them, and retrieving the top-K most similar chunks for a given query. In practice, this naive approach fails in ways that are frustrating to debug and expensive to fix after deployment. The retrieved chunks are often irrelevant, the chunking breaks context at critical boundaries, and the system has no way to recognize when the retrieval result is insufficient. Advanced RAG techniques address each of these failure modes directly.
Why Basic RAG Fails
Poor retrieval quality is the most common failure. Semantic similarity between a question and a chunk does not guarantee that the chunk contains the answer. "What are the payment terms?" might retrieve a chunk about payment methods (high semantic similarity) rather than the actual payment terms clause (lower similarity because it uses different vocabulary).
Context fragmentation happens when the answer spans multiple chunks or when a chunk starts mid-sentence because the text splitter counted characters without understanding sentence boundaries. The model receives incomplete information and either hallucinates the missing parts or gives an uncertain answer.
Naive chunking that splits on character count produces chunks that break at arbitrary points: mid-formula, mid-table, mid-numbered-list. The semantic coherence of the chunk is destroyed.
No quality signal on retrieval results: basic RAG always retrieves K chunks and always passes them to the model, even when none of them are relevant. The model has no mechanism to say "I did not find a relevant answer."
Technique 1: Hybrid Search
Hybrid search combines semantic search (embedding similarity) with keyword search (BM25 or similar). Each retrieval method has complementary strengths.
Semantic search handles paraphrase and related concepts: "compensation" matches "salary" even though the words differ. Keyword search handles exact terminology: a query for "Section 12.4(b)" finds documents containing that exact string, which semantic search may miss because the embedding treats all legal section references as similar.
Implementation with Elasticsearch or OpenSearch using both kNN and BM25:
from elasticsearch import Elasticsearch
es = Elasticsearch()
def hybrid_search(query: str, index: str, top_k: int = 10):
embedding = get_embedding(query)
response = es.search(
index=index,
body={
"query": {
"bool": {
"should": [
{"match": {"content": {"query": query, "boost": 0.3}}},
{"knn": {"field": "embedding", "query_vector": embedding, "num_candidates": 50, "boost": 0.7}}
]
}
},
"size": top_k
}
)
return [hit["_source"]["content"] for hit in response["hits"]["hits"]]
The boost weights control how much each method contributes. Typical ratios are 0.3 keyword / 0.7 semantic, but tune these on your eval set.
Technique 2: HyDE (Hypothetical Document Embeddings)
HyDE addresses a fundamental mismatch: queries are short and conversational, while documents are long and formal. Their embedding spaces are not well-aligned.
HyDE solves this by generating a hypothetical answer to the query, embedding the hypothetical answer (not the query), and using that embedding for retrieval. The hypothetical answer is in the same linguistic register as the documents.
import anthropic
client = anthropic.Anthropic()
def hyde_retrieve(query: str, vector_store, top_k: int = 5):
# Generate a hypothetical answer
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Write a passage that would answer this question from a technical document: {query}"
}]
)
hypothetical_answer = response.content[0].text
# Embed the hypothetical answer, not the query
hypothetical_embedding = get_embedding(hypothetical_answer)
# Retrieve using the hypothetical answer's embedding
return vector_store.similarity_search_by_vector(hypothetical_embedding, k=top_k)
HyDE consistently improves retrieval quality on domain-specific corpora where query vocabulary differs significantly from document vocabulary.
Technique 3: Parent-Child Chunking
Retrieve small chunks for precision, return their larger parent chunks for context. This addresses the context fragmentation problem without bloating the retrieval index with large chunks.
The index contains small chunks (200-300 tokens) for high-precision embedding similarity. Each small chunk stores a reference to its parent chunk (500-1000 tokens, or the full document section). When a small chunk is retrieved, the system returns the parent chunk to the model, providing full context around the retrieved passage.
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])
nodes = parser.get_nodes_from_documents(documents)
# Build index from leaf nodes (smallest chunks)
leaf_index = VectorStoreIndex(leaf_nodes)
# Retriever that merges to parent when enough children are retrieved
retriever = AutoMergingRetriever(leaf_index.as_retriever(similarity_top_k=6), storage_context)
LlamaIndex's AutoMergingRetriever handles this automatically: if enough sibling chunks from the same parent are retrieved, it replaces them with the parent chunk.
Technique 4: Query Rewriting
The user's query may not be the best search query. A question like "how do I handle this?" contains no domain vocabulary. Query rewriting reformulates the question to improve retrieval.
def rewrite_query(original_query: str) -> list[str]:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Generate 3 different search queries that would find relevant passages to answer this question.
Question: {original_query}
Return exactly 3 queries, one per line, no numbering."""
}]
)
queries = response.content[0].text.strip().split("
")
return [q.strip() for q in queries if q.strip()]
Retrieve for each rewritten query, deduplicate results, and pass the union to the model. This is sometimes called query expansion or multi-query retrieval.
Technique 5: Re-Ranking
Retrieve a larger candidate set (top 20-50) using embedding similarity, then use a cross-encoder to re-rank and select the final top 5. Cross-encoders compute relevance by attending to both the query and the document simultaneously, which is more accurate than embedding similarity but too slow to run over the entire index.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[str], top_k: int = 5) -> list[str]:
pairs = [(query, candidate) for candidate in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:top_k]]
Re-ranking consistently improves precision at the cost of additional latency (100-300ms for a small re-ranker model).
Agentic RAG: When to Retrieve and What to Retrieve
Standard RAG retrieves once per query. Agentic RAG gives the LLM control over retrieval: it decides when to retrieve, what query to use, whether the retrieved result is sufficient, and whether to retrieve again with a different query.
This handles multi-hop questions (questions that require multiple retrieval steps), questions where the initial retrieval fails, and questions where the model needs to reformulate the query based on intermediate results.
The trade-off: agentic RAG is more capable but slower and more expensive. Use it when the query complexity justifies it.
When Advanced RAG Is Worth the Complexity
Implement each technique in order of impact for your specific failure mode:
- Wrong chunks retrieved: start with hybrid search and query rewriting.
- Context cut off mid-sentence: implement parent-child chunking.
- Vocabulary mismatch: add HyDE.
- Precision after large retrieval set: add re-ranking.
- Multi-hop or iterative questions: implement agentic RAG.
Do not implement all techniques simultaneously. Each adds latency and complexity. Add the technique that addresses the most frequent failure mode on your eval set, measure improvement, then decide whether to add more.
Keep Reading
- LlamaIndex for RAG: A Practical Implementation Guide — how to implement a RAG pipeline from scratch with LlamaIndex
- Memory in AI Agents: Short-Term, Long-Term, and Episodic — how retrieval patterns from advanced RAG apply to agent memory systems
- LangChain Complete Guide 2026 — how to implement these techniques in a LangChain pipeline
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.