HuggingFace's Answer to GPT-4V
The HuggingFace Multimodal team (M4) released Idefics2 as a fully open alternative to closed multimodal APIs. At 8B parameters — Mistral 7B language backbone plus SigLIP vision encoder — it is deployable on a single A100 80GB and competitive with much larger proprietary models on several benchmarks.
Architecture Highlights
Idefics2 connects a SigLIP vision encoder to a Mistral 7B language model via a learned perceiver resampler. Unlike models that resize all images to a fixed resolution and pad the rest, Idefics2 preserves native aspect ratios by tiling: large images are split into sub-images that are encoded independently, then concatenated. This means a 1024×512 image and a 512×1024 image are both handled without distortion.
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
image = Image.open("document.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is the total revenue shown in this table?"},
],
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0], skip_special_tokens=True))
Interleaved Image-Text Sequences
Idefics2 can handle multiple images interleaved with text in a single prompt — a capability most VLMs lack. This is critical for tasks like: comparing two charts, answering questions about a multi-page document, or following instructions that reference multiple reference images. The model maintains coherent reasoning across the full interleaved sequence.
The OBELICS Dataset
Idefics2 was trained on OBELICS (Open Benchmark of Large Interleaved Corpora and Sequences), a 115B token dataset of web-scraped interleaved image-text documents. Unlike earlier multimodal datasets that pair single images with captions, OBELICS contains full web pages with multiple images and surrounding text — which explains why Idefics2 performs well on document-level tasks.
Fine-Tuning for Document Understanding
For teams needing custom document or chart extraction, Idefics2 provides a strong starting point:
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="idefics2-finetuned",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-5,
num_train_epochs=3,
bf16=True,
)
# SFTTrainer handles vision inputs automatically when dataset includes image columns
Comparison to LLaVA and PaliGemma
LLaVA 1.6 uses a similar Mistral backbone but fixed-resolution image encoding. PaliGemma is purpose-built for fine-tuning but smaller (3B). Idefics2 occupies a useful middle ground: large enough for strong zero-shot performance, open enough to fine-tune, and natively supports multi-image interleaved prompts.