A VLM Built for Fine-Tuning
Most vision-language models are released with zero-shot capability as the headline feature. PaliGemma takes a different stance: it is designed as a transfer learning base. Google trained it to be fine-tuned on specific vision tasks rather than to be a general-purpose assistant out of the box. This design choice shapes everything about how you should use it.
Architecture: SigLIP + Gemma 2B
PaliGemma has two components:
- SigLIP vision encoder — processes images at 224×224 (or 448×448 in the larger variant), producing 256 visual tokens via sigmoid loss contrastive pretraining (more stable than CLIP's softmax)
- Gemma 2B language model — takes the visual tokens concatenated with text tokens and generates output autoregressively
Total parameter count is approximately 3B, making it practical on a single A100 or even high-end consumer GPUs with quantization.
What Fine-Tuning Tasks Look Like
PaliGemma was pretrained on a mixture of tasks using a unified text format. Fine-tuning for a new task follows the same pattern:
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image
import torch
model_id = "google/paligemma-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
image = Image.open("chart.png")
prompt = "caption en" # or "answer en What is the trend shown?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
Supported Fine-Tuning Tasks
PaliGemma's pretraining covered: image captioning, visual question answering, object detection (outputs bounding box coordinates in text), image segmentation (outputs segmentation tokens), OCR, and chart understanding. Fine-tuning on custom datasets for any of these tasks is straightforward because the model already has the relevant capabilities — you are adapting rather than teaching from scratch.
PaliGemma 2: The Updated Backbone
PaliGemma 2 replaced Gemma 2B with the Gemma 2 backbone and expanded to three sizes: 3B, 10B, and 28B. The 10B variant achieves strong performance on DocVQA (document visual QA) and ChartQA without task-specific engineering, making it the recommended starting point for enterprise document workflows.
When to Use PaliGemma vs GPT-4V
GPT-4V excels at general zero-shot visual reasoning but is expensive, opaque, and rate-limited. PaliGemma is better when you have labeled data for a specific task, need to run inference at scale on your own infrastructure, or require full control over the model weights for compliance reasons. For receipt parsing, form extraction, or domain-specific captioning, a fine-tuned PaliGemma 3B will typically outperform GPT-4V at a fraction of the cost.