ColPali: Visual RAG Without OCR Using Late Interaction Embeddings

ColPali treats PDF pages as images and uses vision-language embeddings with ColBERT-style late interaction scoring, bypassing the lossy OCR pipeline that breaks table, chart, and layout-heavy document retrieval.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 25, 2026

7 min read

// tags

#colpali#multimodal-rag#vision#pdf#late-interaction

FIG. ART-29

7 min read

“

ColPali: Visual RAG Without OCR Using Late Interaction Embeddings

// reading plan

sections

421

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format — export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

The OCR Problem in Document RAG

Standard PDF RAG pipelines look like: extract text with OCR → chunk → embed → retrieve. This pipeline quietly discards 30-60% of information in typical business documents:

Tables: OCR flattens rows into prose, destroying relational structure
Charts and graphs: Converted to alt-text or skipped entirely
Two-column layouts: Text order scrambles when OCR reads column by column
Formulas and equations: OCR produces garbled LaTeX or nothing
Annotations and highlights: Invisible to text extractors

ColPali asks a different question: what if we never extract text at all?

The ColPali Approach

ColPali renders each PDF page as an image and embeds it using a vision-language model — no OCR, no text extraction. At query time, the text query is matched against page image embeddings using late interaction scoring.

Architecture:

PaliGemma backbone: Google's vision-language model processes page images into patch embeddings (one embedding per 14x14 pixel patch for a 1024x1024 image = ~5000 patch vectors per page)
Late interaction (ColBERT-style): Query tokens are matched against document patch tokens with MaxSim scoring — each query token finds its best-matching page patch
Aggregated score: Sum of MaxSim scores across query tokens becomes the relevance score

The HuggingFace ColPali model includes the full PaliGemma fine-tune.

DocVQA Benchmark

On the ViDoRe benchmark (visual document retrieval), ColPali achieves significantly higher nDCG@5 than OCR-based pipelines, particularly on document types where OCR degrades:

Financial reports with tables: +18 nDCG@5 over BM25+OCR
Scientific papers with figures: +24 nDCG@5
Slide decks: +31 nDCG@5

Running ColPali

from colpali_engine.models import ColPali, ColPaliProcessor
from PIL import Image
import torch

model = ColPali.from_pretrained(
    "vidore/colpali-v1.2",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")

# Convert PDF pages to images (e.g., with pdf2image)
page_images = [Image.open("page_1.png"), Image.open("page_2.png")]

# Embed pages
page_inputs = processor.process_images(page_images).to("cuda")
with torch.no_grad():
    page_embeddings = model(**page_inputs)

# Embed query
queries = ["What was the Q3 revenue?"]
query_inputs = processor.process_queries(queries).to("cuda")
with torch.no_grad():
    query_embeddings = model(**query_inputs)

# Score pages against query
scores = processor.score_multi_vector(query_embeddings, page_embeddings)
best_page = scores.argmax().item()
print(f"Most relevant page: {best_page}")

Vespa.ai Integration

For production-scale deployment with millions of pages, Vespa.ai supports ColBERT-style multi-vector indexing natively. The ColPali team provides Vespa application packages in the repository. Vespa's tensor computations handle the MaxSim scoring at scale without loading all page embeddings into GPU memory.

When to Use ColPali

ColPali is the right choice for: annual reports, technical manuals, research papers with figures, slide decks, and any documents where layout carries meaning. For plain-text documents (support tickets, emails, plain blog posts), traditional embedding + BM25 hybrid remains faster and cheaper.

ColPali: Visual RAG Without OCR Using Late Interaction Embeddings

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

The OCR Problem in Document RAG

The ColPali Approach

DocVQA Benchmark

Running ColPali

Vespa.ai Integration

When to Use ColPali

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

ColPali: Visual RAG Without OCR Using Late Interaction Embeddings

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

The OCR Problem in Document RAG

The ColPali Approach

DocVQA Benchmark

Running ColPali

Vespa.ai Integration

When to Use ColPali

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs