PaliGemma: Google's Compact Vision-Language Model for Fine-Tuning

PaliGemma combines SigLIP vision encoding with Gemma 2B language generation in a 3B model explicitly designed to be fine-tuned rather than used zero-shot.

Mahmudul Haque Qudrati

CEO & ML Engineer

March 20, 2026

7 min read

// tags

#paligemma#google#vision-language#fine-tuning#siglip

FIG. ART-29

7 min read

“

PaliGemma: Google's Compact Vision-Language Model for Fine-Tuning

// reading plan

sections

409

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

What Fine-Tuning Tasks Look Like

PaliGemma was pretrained on a mixture of tasks using a unified text format. Fine-tuning for a new task follows the same pattern:

from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image
import torch

model_id = "google/paligemma-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

image = Image.open("chart.png")
prompt = "caption en"  # or "answer en What is the trend shown?"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Supported Fine-Tuning Tasks

PaliGemma's pretraining covered: image captioning, visual question answering, object detection (outputs bounding box coordinates in text), image segmentation (outputs segmentation tokens), OCR, and chart understanding. Fine-tuning on custom datasets for any of these tasks is straightforward because the model already has the relevant capabilities - you are adapting rather than teaching from scratch.

PaliGemma 2: The Updated Backbone

PaliGemma 2 replaced Gemma 2B with the Gemma 2 backbone and expanded to three sizes: 3B, 10B, and 28B. The 10B variant achieves strong performance on DocVQA (document visual QA) and ChartQA without task-specific engineering, making it the recommended starting point for enterprise document workflows.

When to Use PaliGemma vs GPT-4V

GPT-4V excels at general zero-shot visual reasoning but is expensive, opaque, and rate-limited. PaliGemma is better when you have labeled data for a specific task, need to run inference at scale on your own infrastructure, or require full control over the model weights for compliance reasons. For receipt parsing, form extraction, or domain-specific captioning, a fine-tuned PaliGemma 3B will typically outperform GPT-4V at a fraction of the cost.

PaliGemma: Google's Compact Vision-Language Model for Fine-Tuning

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

A VLM Built for Fine-Tuning

Architecture: SigLIP + Gemma 2B

What Fine-Tuning Tasks Look Like

Supported Fine-Tuning Tasks

PaliGemma 2: The Updated Backbone

When to Use PaliGemma vs GPT-4V

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

PaliGemma: Google's Compact Vision-Language Model for Fine-Tuning

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

A VLM Built for Fine-Tuning

Architecture: SigLIP + Gemma 2B

What Fine-Tuning Tasks Look Like

Supported Fine-Tuning Tasks

PaliGemma 2: The Updated Backbone

When to Use PaliGemma vs GPT-4V

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs