LlamaIndex is the right tool when your primary use case is retrieval-augmented generation: loading documents, indexing them, and answering questions over them. It requires significantly less boilerplate than LangChain for this specific problem, and its API is designed around the retrieval pipeline rather than the general LLM application pattern. If your agent needs to answer questions from a document corpus, start with LlamaIndex.
How LlamaIndex Differs From LangChain
LangChain is a general-purpose LLM framework. It handles retrieval, agents, memory, and chains, but it treats retrieval as one feature among many. LlamaIndex treats retrieval as the primary feature. The result is a more opinionated API with less configuration required to get a working RAG pipeline.
LangChain requires you to instantiate a document loader, a text splitter, an embedding model, a vector store, a retriever, and a chain. LlamaIndex wraps these steps into fewer abstractions while still allowing customization at each layer.
Core Components
SimpleDirectoryReader is how you load documents. Point it at a folder and it handles PDFs, Word files, text files, Markdown, and HTML automatically:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./docs").load_data()
No manual file-type handling. No separate loaders per extension. This alone saves time on projects with mixed document types.
VectorStoreIndex takes your documents, chunks them, embeds the chunks, and stores them in a vector index. By default it uses an in-memory store, which is fine for development:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
For production, you swap the storage context to point at Pinecone, Chroma, Qdrant, or pgvector:
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import StorageContext
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
The interface is identical. Switching backends is a constructor change.
QueryEngine is the question-answering layer. It retrieves relevant chunks and synthesizes an answer:
query_engine = index.as_query_engine()
response = query_engine.query("What are the payment terms in the contract?")
print(response.response)
The QueryEngine handles retrieval, context assembly, and LLM call in one method call. You can configure the number of retrieved chunks (similarity_top_k), the LLM, and the response mode.
Response Synthesizer is what turns retrieved chunks into a coherent answer. LlamaIndex provides several modes: compact (fits as many chunks as possible into one LLM call), refine (iteratively refines an answer chunk by chunk), tree_summarize (builds a tree of summaries for large document sets). The default compact mode works well for most use cases.
A Full Working Example
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure models
Settings.llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Load and index documents
documents = SimpleDirectoryReader("./knowledge_base").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Summarize the refund policy.")
print(response.response)
print("Sources:", [node.metadata for node in response.source_nodes])
The source nodes tell you which document chunks were used to generate the answer, which is important for citations and for debugging wrong answers.
When to Choose LlamaIndex Over LangChain
Choose LlamaIndex when:
- The core use case is document Q&A or RAG over a document corpus.
- You want a simpler API with fewer concepts to learn.
- You need built-in source attribution (which chunks were used).
- You are building a chat interface over internal documents (support knowledge base, legal docs, product manuals).
Stick with LangChain when:
- You need multi-tool agents with complex tool use patterns.
- You are building pipelines that go well beyond retrieval (data transformation, multi-model routing, complex memory).
- Your team already knows LangChain and the project is not purely RAG.
LlamaIndex and LangChain are not mutually exclusive. It is possible to use LlamaIndex's retrieval pipeline inside a LangChain agent by wrapping the query engine as a tool.
Evaluation With TruLens and RAGAs
A RAG pipeline that retrieves wrong chunks or produces hallucinated answers is worse than no RAG at all. Evaluation is not optional.
TruLens provides RAG triad evaluation: context relevance (are the retrieved chunks relevant to the question?), groundedness (is the answer grounded in the retrieved context?), and answer relevance (does the answer actually address the question?). It integrates directly with LlamaIndex:
from trulens.apps.llamaindex import TruLlama
tru_recorder = TruLlama(
query_engine,
app_name="contract-qa",
feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance]
)
with tru_recorder as recording:
response = query_engine.query("What is the notice period?")
RAGAs (Retrieval Augmented Generation Assessment) offers similar metrics and works without requiring ground-truth labels, using LLMs to evaluate LLM output. Both are worth running on a representative test set before deploying to production.
Chunking Strategy Matters More Than People Think
The default chunking (1024 tokens, 20-token overlap) works for homogeneous text. For heterogeneous document sets, it fails. A legal contract with numbered clauses chunks differently than a product manual with tables. LlamaIndex's SentenceSplitter and SemanticSplitter produce better chunks for structured documents. The semantic splitter uses embedding similarity to find natural break points rather than counting characters.
Keep Reading
- Advanced RAG: Beyond Basic Chunk Retrieval — hybrid search, HyDE, re-ranking, and agentic RAG
- LangChain Complete Guide 2026 — when to use LangChain instead and what LCEL changed
- Memory in AI Agents — how to persist knowledge across sessions beyond a single RAG query
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.