Advanced RAG: Beyond Basic Chunk Retrieval

Basic RAG retrieves the wrong chunks and loses context across chunk boundaries. Advanced techniques including hybrid search, HyDE, re-ranking, and agentic retrieval fix these problems systematically.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

9 min read

// tags

#rag#retrieval#hybrid-search#hyde#re-ranking#ai-agents

FIG. ART-39

9 min read

“

Advanced RAG: Beyond Basic Chunk Retrieval

// reading plan

sections

1,185

words

min read

// Prompt Engineering

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Customer support AI fails in predictable ways. The right system prompt prevents most of them. Here are the patterns that work and the mistakes that create problems.

9 min read

// Prompt Engineering

Basic RAG works by splitting documents into chunks, embedding them, and retrieving the top-K most similar chunks for a given query. In practice, this naive approach fails in ways that are frustrating to debug and expensive to fix after deployment. The retrieved chunks are often irrelevant, the chunking breaks context at critical boundaries, and the system has no way to recognize when the retrieval result is insufficient. Advanced RAG techniques address each of these failure modes directly.

Why Basic RAG Fails

Poor retrieval quality is the most common failure. Semantic similarity between a question and a chunk does not guarantee that the chunk contains the answer. "What are the payment terms?" might retrieve a chunk about payment methods (high semantic similarity) rather than the actual payment terms clause (lower similarity because it uses different vocabulary).

Context fragmentation happens when the answer spans multiple chunks or when a chunk starts mid-sentence because the text splitter counted characters without understanding sentence boundaries. The model receives incomplete information and either hallucinates the missing parts or gives an uncertain answer.

Naive chunking that splits on character count produces chunks that break at arbitrary points: mid-formula, mid-table, mid-numbered-list. The semantic coherence of the chunk is destroyed.

No quality signal on retrieval results: basic RAG always retrieves K chunks and always passes them to the model, even when none of them are relevant. The model has no mechanism to say "I did not find a relevant answer."

Technique 1: Hybrid Search

Hybrid search combines semantic search (embedding similarity) with keyword search (BM25 or similar). Each retrieval method has complementary strengths.

Semantic search handles paraphrase and related concepts: "compensation" matches "salary" even though the words differ. Keyword search handles exact terminology: a query for "Section 12.4(b)" finds documents containing that exact string, which semantic search may miss because the embedding treats all legal section references as similar.

Implementation with Elasticsearch or OpenSearch using both kNN and BM25:

from elasticsearch import Elasticsearch

es = Elasticsearch()

def hybrid_search(query: str, index: str, top_k: int = 10):
    embedding = get_embedding(query)

    response = es.search(
        index=index,
        body={
            "query": {
                "bool": {
                    "should": [
                        {"match": {"content": {"query": query, "boost": 0.3}}},
                        {"knn": {"field": "embedding", "query_vector": embedding, "num_candidates": 50, "boost": 0.7}}
                    ]
                }
            },
            "size": top_k
        }
    )
    return [hit["_source"]["content"] for hit in response["hits"]["hits"]]

The boost weights control how much each method contributes. Typical ratios are 0.3 keyword / 0.7 semantic, but tune these on your eval set.

Technique 2: HyDE (Hypothetical Document Embeddings)

HyDE addresses a fundamental mismatch: queries are short and conversational, while documents are long and formal. Their embedding spaces are not well-aligned.

HyDE solves this by generating a hypothetical answer to the query, embedding the hypothetical answer (not the query), and using that embedding for retrieval. The hypothetical answer is in the same linguistic register as the documents.

import anthropic

client = anthropic.Anthropic()

def hyde_retrieve(query: str, vector_store, top_k: int = 5):
    # Generate a hypothetical answer
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"Write a passage that would answer this question from a technical document: {query}"
        }]
    )
    hypothetical_answer = response.content[0].text

    # Embed the hypothetical answer, not the query
    hypothetical_embedding = get_embedding(hypothetical_answer)

    # Retrieve using the hypothetical answer's embedding
    return vector_store.similarity_search_by_vector(hypothetical_embedding, k=top_k)

HyDE consistently improves retrieval quality on domain-specific corpora where query vocabulary differs significantly from document vocabulary.

Technique 3: Parent-Child Chunking

Retrieve small chunks for precision, return their larger parent chunks for context. This addresses the context fragmentation problem without bloating the retrieval index with large chunks.

The index contains small chunks (200-300 tokens) for high-precision embedding similarity. Each small chunk stores a reference to its parent chunk (500-1000 tokens, or the full document section). When a small chunk is retrieved, the system returns the parent chunk to the model, providing full context around the retrieved passage.

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048, 512, 128])
nodes = parser.get_nodes_from_documents(documents)

# Build index from leaf nodes (smallest chunks)
leaf_index = VectorStoreIndex(leaf_nodes)

# Retriever that merges to parent when enough children are retrieved
retriever = AutoMergingRetriever(leaf_index.as_retriever(similarity_top_k=6), storage_context)

LlamaIndex's AutoMergingRetriever handles this automatically: if enough sibling chunks from the same parent are retrieved, it replaces them with the parent chunk.

Technique 4: Query Rewriting

The user's query may not be the best search query. A question like "how do I handle this?" contains no domain vocabulary. Query rewriting reformulates the question to improve retrieval.

def rewrite_query(original_query: str) -> list[str]:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Generate 3 different search queries that would find relevant passages to answer this question.
Question: {original_query}
Return exactly 3 queries, one per line, no numbering."""
        }]
    )
    queries = response.content[0].text.strip().split("
")
    return [q.strip() for q in queries if q.strip()]

Retrieve for each rewritten query, deduplicate results, and pass the union to the model. This is sometimes called query expansion or multi-query retrieval.

Technique 5: Re-Ranking

Retrieve a larger candidate set (top 20-50) using embedding similarity, then use a cross-encoder to re-rank and select the final top 5. Cross-encoders compute relevance by attending to both the query and the document simultaneously, which is more accurate than embedding similarity but too slow to run over the entire index.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[str], top_k: int = 5) -> list[str]:
    pairs = [(query, candidate) for candidate in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_k]]

Re-ranking consistently improves precision at the cost of additional latency (100-300ms for a small re-ranker model).

Agentic RAG: When to Retrieve and What to Retrieve

Standard RAG retrieves once per query. Agentic RAG gives the LLM control over retrieval: it decides when to retrieve, what query to use, whether the retrieved result is sufficient, and whether to retrieve again with a different query.

This handles multi-hop questions (questions that require multiple retrieval steps), questions where the initial retrieval fails, and questions where the model needs to reformulate the query based on intermediate results.

The trade-off: agentic RAG is more capable but slower and more expensive. Use it when the query complexity justifies it.

When Advanced RAG Is Worth the Complexity

Implement each technique in order of impact for your specific failure mode:

Wrong chunks retrieved: start with hybrid search and query rewriting.
Context cut off mid-sentence: implement parent-child chunking.
Vocabulary mismatch: add HyDE.
Precision after large retrieval set: add re-ranking.
Multi-hop or iterative questions: implement agentic RAG.

Do not implement all techniques simultaneously. Each adds latency and complexity. Add the technique that addresses the most frequent failure mode on your eval set, measure improvement, then decide whether to add more.

Keep Reading

LlamaIndex for RAG: A Practical Implementation Guide — how to implement a RAG pipeline from scratch with LlamaIndex
Memory in AI Agents: Short-Term, Long-Term, and Episodic — how retrieval patterns from advanced RAG apply to agent memory systems
LangChain Complete Guide 2026 — how to implement these techniques in a LangChain pipeline

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Advanced RAG: Beyond Basic Chunk Retrieval

Related Articles

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Why Basic RAG Fails

Technique 1: Hybrid Search

Technique 2: HyDE (Hypothetical Document Embeddings)

Technique 3: Parent-Child Chunking

Technique 4: Query Rewriting

Technique 5: Re-Ranking

Agentic RAG: When to Retrieve and What to Retrieve

When Advanced RAG Is Worth the Complexity

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

AutoGen: Microsoft's Multi-Agent Framework Explained

Advanced RAG: Beyond Basic Chunk Retrieval

Related Articles

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Why Basic RAG Fails

Technique 1: Hybrid Search

Technique 2: HyDE (Hypothetical Document Embeddings)

Technique 3: Parent-Child Chunking

Technique 4: Query Rewriting

Technique 5: Re-Ranking

Agentic RAG: When to Retrieve and What to Retrieve

When Advanced RAG Is Worth the Complexity

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

AutoGen: Microsoft's Multi-Agent Framework Explained

The workspace your team
actually needs