The OCR Problem in Document RAG
Standard PDF RAG pipelines look like: extract text with OCR → chunk → embed → retrieve. This pipeline quietly discards 30-60% of information in typical business documents:
- Tables: OCR flattens rows into prose, destroying relational structure
- Charts and graphs: Converted to alt-text or skipped entirely
- Two-column layouts: Text order scrambles when OCR reads column by column
- Formulas and equations: OCR produces garbled LaTeX or nothing
- Annotations and highlights: Invisible to text extractors
ColPali asks a different question: what if we never extract text at all?
The ColPali Approach
ColPali renders each PDF page as an image and embeds it using a vision-language model — no OCR, no text extraction. At query time, the text query is matched against page image embeddings using late interaction scoring.
Architecture:
- PaliGemma backbone: Google's vision-language model processes page images into patch embeddings (one embedding per 14x14 pixel patch for a 1024x1024 image = ~5000 patch vectors per page)
- Late interaction (ColBERT-style): Query tokens are matched against document patch tokens with MaxSim scoring — each query token finds its best-matching page patch
- Aggregated score: Sum of MaxSim scores across query tokens becomes the relevance score
The HuggingFace ColPali model includes the full PaliGemma fine-tune.
DocVQA Benchmark
On the ViDoRe benchmark (visual document retrieval), ColPali achieves significantly higher nDCG@5 than OCR-based pipelines, particularly on document types where OCR degrades:
- Financial reports with tables: +18 nDCG@5 over BM25+OCR
- Scientific papers with figures: +24 nDCG@5
- Slide decks: +31 nDCG@5
Running ColPali
from colpali_engine.models import ColPali, ColPaliProcessor
from PIL import Image
import torch
model = ColPali.from_pretrained(
"vidore/colpali-v1.2",
torch_dtype=torch.bfloat16,
device_map="cuda"
)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
# Convert PDF pages to images (e.g., with pdf2image)
page_images = [Image.open("page_1.png"), Image.open("page_2.png")]
# Embed pages
page_inputs = processor.process_images(page_images).to("cuda")
with torch.no_grad():
page_embeddings = model(**page_inputs)
# Embed query
queries = ["What was the Q3 revenue?"]
query_inputs = processor.process_queries(queries).to("cuda")
with torch.no_grad():
query_embeddings = model(**query_inputs)
# Score pages against query
scores = processor.score_multi_vector(query_embeddings, page_embeddings)
best_page = scores.argmax().item()
print(f"Most relevant page: {best_page}")
Vespa.ai Integration
For production-scale deployment with millions of pages, Vespa.ai supports ColBERT-style multi-vector indexing natively. The ColPali team provides Vespa application packages in the repository. Vespa's tensor computations handle the MaxSim scoring at scale without loading all page embeddings into GPU memory.
When to Use ColPali
ColPali is the right choice for: annual reports, technical manuals, research papers with figures, slide decks, and any documents where layout carries meaning. For plain-text documents (support tickets, emails, plain blog posts), traditional embedding + BM25 hybrid remains faster and cheaper.