What LlamaIndex Does
LlamaIndex is a data framework for LLM applications. Its core abstraction is the index: a structure that ingests documents, chunks and embeds them, and exposes a query interface. Building RAG from scratch requires 500+ lines of glue code; LlamaIndex compresses this to a few dozen lines while remaining highly configurable.
Installation
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
Five-Minute RAG From a PDF
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
# Configure models
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = "local" # uses sentence-transformers locally
# Load documents from a directory
documents = SimpleDirectoryReader("./docs/").load_data()
# Build index (chunks, embeds, stores in-memory)
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is PagedAttention?")
print(response)
Document Loaders
LlamaIndex ships 100+ loaders for common data sources:
from llama_index.readers.web import SimpleWebPageReader
from llama_index.readers.github import GithubRepositoryReader
from llama_index.readers.notion import NotionPageReader
web_docs = SimpleWebPageReader().load_data(["https://docs.vllm.ai/"])
github_docs = GithubRepositoryReader(owner="vllm-project", repo="vllm").load_data()
Node Parsers and Chunking
Control chunking strategy:
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
# Fixed-size with overlap
parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)
# Semantic chunking (groups semantically similar sentences)
semantic_parser = SemanticSplitterNodeParser(embed_model=Settings.embed_model)
nodes = semantic_parser.get_nodes_from_documents(documents)
Response Synthesizer Modes
from llama_index.core import get_response_synthesizer
# compact: default, fits context into fewest LLM calls
# refine: iterative, updates answer as it reads each chunk
# tree_summarize: builds summary tree bottom-up
synthesizer = get_response_synthesizer(response_mode="tree_summarize")
query_engine = index.as_query_engine(response_synthesizer=synthesizer)
Sub-Question Query Engine
Break complex questions into sub-questions answered by different indices:
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool
tools = [
QueryEngineTool.from_defaults(query_engine=vllm_index.as_query_engine(), name="vllm_docs", description="vLLM documentation"),
QueryEngineTool.from_defaults(query_engine=ollama_index.as_query_engine(), name="ollama_docs", description="Ollama documentation"),
]
engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = engine.query("Compare vLLM and Ollama for production serving")
Streaming Responses
query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Explain HNSW indexing")
for text_chunk in streaming_response.response_gen:
print(text_chunk, end="", flush=True)
Metadata Filtering
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
filters = MetadataFilters(filters=[MetadataFilter(key="source", value="technical-docs")])
query_engine = index.as_query_engine(filters=filters)
Full documentation at docs.llamaindex.ai.