The Bootstrap Problem

Training large vision-language models end-to-end requires massive compute - updating both a vision encoder and an LLM simultaneously. BLIP-2 from Salesforce avoids this by keeping both components frozen and training only a small bridge module: the Q-Former.

The Q-Former (Querying Transformer) contains 32 learned query tokens that attend to frozen image features from CLIP and produce a fixed-size representation passed to the frozen LLM. Only the Q-Former (~188M parameters) is trained during both pre-training stages.

Two-Stage Training

Stage 1 (vision-language alignment): Q-Former learns to extract relevant visual information through three objectives:

Image-text contrastive learning
Image-grounded text generation
Image-text matching

Stage 2 (generative learning): Q-Former output is projected into the LLM's embedding space, and the model learns to generate text conditioned on visual features.

This separation means you can swap LLM backends without retraining the vision alignment.

Loading in 4 Lines

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto",
)

image = Image.open("image.jpg").convert("RGB")

# Image captioning
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(caption)

# Visual Q&A
inputs = processor(images=image, text="Question: What color is the car? Answer:", return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(answer)

The HuggingFace BLIP-2 page hosts the OPT-2.7B and FlanT5-XL backends. Try the live demo without any setup.

Memory With 8-Bit Quantization

BLIP-2 with OPT-2.7B requires ~15GB VRAM in float16. Load in 8-bit to fit on a T4 (16GB VRAM with margin):

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    load_in_8bit=True,
    device_map="auto",
)

8-bit quantization reduces VRAM to ~9GB with <2% quality degradation on VQA benchmarks.

Comparison to LLaVA on VQA Benchmarks

On VQAv2, BLIP-2 (FlanT5-XL backend) scores ~65.0% vs LLaVA-1.5-7B at ~78.5%. LLaVA significantly outperforms BLIP-2 on general VQA, reflecting its larger language model and instruction tuning. BLIP-2's advantage is the bootstrapping architecture - it requires no paired image-text instruction data to train, making it more adaptable to new visual domains with small datasets.

BLIP-2: Bootstrap Vision-Language Models With Frozen Image Encoders

The Bootstrap Problem

Two-Stage Training

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Loading in 4 Lines

Memory With 8-Bit Quantization

Comparison to LLaVA on VQA Benchmarks

The workspace your team
actually needs

BLIP-2: Bootstrap Vision-Language Models With Frozen Image Encoders

The Bootstrap Problem

Two-Stage Training

AI & ML insights, weekly

Mahmudul Haque Qudrati

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Loading in 4 Lines

Memory With 8-Bit Quantization

Comparison to LLaVA on VQA Benchmarks

The workspace your teamactually needs

The workspace your team
actually needs