The Bootstrap Problem
Training large vision-language models end-to-end requires massive compute — updating both a vision encoder and an LLM simultaneously. BLIP-2 from Salesforce avoids this by keeping both components frozen and training only a small bridge module: the Q-Former.
The Q-Former (Querying Transformer) contains 32 learned query tokens that attend to frozen image features from CLIP and produce a fixed-size representation passed to the frozen LLM. Only the Q-Former (~188M parameters) is trained during both pre-training stages.
Two-Stage Training
Stage 1 (vision-language alignment): Q-Former learns to extract relevant visual information through three objectives:
- Image-text contrastive learning
- Image-grounded text generation
- Image-text matching
Stage 2 (generative learning): Q-Former output is projected into the LLM's embedding space, and the model learns to generate text conditioned on visual features.
This separation means you can swap LLM backends without retraining the vision alignment.
Loading in 4 Lines
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto",
)
image = Image.open("image.jpg").convert("RGB")
# Image captioning
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(caption)
# Visual Q&A
inputs = processor(images=image, text="Question: What color is the car? Answer:", return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(answer)
The HuggingFace BLIP-2 page hosts the OPT-2.7B and FlanT5-XL backends. Try the live demo without any setup.
Memory With 8-Bit Quantization
BLIP-2 with OPT-2.7B requires ~15GB VRAM in float16. Load in 8-bit to fit on a T4 (16GB VRAM with margin):
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
load_in_8bit=True,
device_map="auto",
)
8-bit quantization reduces VRAM to ~9GB with <2% quality degradation on VQA benchmarks.
Comparison to LLaVA on VQA Benchmarks
On VQAv2, BLIP-2 (FlanT5-XL backend) scores ~65.0% vs LLaVA-1.5-7B at ~78.5%. LLaVA significantly outperforms BLIP-2 on general VQA, reflecting its larger language model and instruction tuning. BLIP-2's advantage is the bootstrapping architecture — it requires no paired image-text instruction data to train, making it more adaptable to new visual domains with small datasets.