What Grounding Means for Vision-Language Models
Standard VQAs answer "what is in this image?" Standard VLMs describe what they see. Kosmos-2 goes further: it answers questions by linking specific phrases in its output to spatial coordinates in the image. Ask "where is the dog?" and the model responds with both a text description and the bounding box coordinates for the region it is describing.
The GRIT Dataset
Kosmos-2 was trained on GRIT (Grounded Image-Text pairs), a large-scale dataset of web images annotated with noun phrase / bounding box alignments. The model learns to produce <box> tokens inline with its text output, formatted as:
<p>The golden retriever</p><box>[[0.12, 0.34, 0.67, 0.89]]</box> is sitting next to <p>the red ball</p><box>[[0.71, 0.55, 0.84, 0.72]]</box>.
Coordinates are normalized to [0, 1] relative to image dimensions, making them resolution-independent.
Running Kosmos-2
from transformers import AutoProcessor, Kosmos2ForConditionalGeneration
from PIL import Image
model = Kosmos2ForConditionalGeneration.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
image = Image.open("street_scene.jpg")
prompt = "<grounding>Describe this image in detail:"
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
processed_text, entities = processor.post_process_generation(generated_text)
for entity_name, (start, end), bboxes in entities:
print(f"Entity: {entity_name}")
for bbox in bboxes:
x1, y1, x2, y2 = bbox
print(f" Box: ({x1:.2f}, {y1:.2f}) -> ({x2:.2f}, {y2:.2f})")
Referring Expression Comprehension
Kosmos-2 supports the inverse task: given a natural language description of a region, locate it. "Find the person wearing a red hat" returns a bounding box — no object detection pipeline or separate grounding model needed.
Phrase Grounding in Captions
When generating captions, Kosmos-2 links each noun phrase to its corresponding region. This creates structured captions that are both human-readable and machine-parseable — useful for building image search indices, generating accessibility descriptions, or populating product databases from catalog images.
Practical Use Cases
The grounding capability unlocks applications that pure VQAs cannot handle: natural language object detection (find a specific object without pre-defined classes), visual evidence for answers (the model shows you which region justified its answer), and extracting structured data (all product prices from a receipt, with bounding boxes for validation).
Comparison to Standard VQA Models
BLIP-2 and LLaVA give accurate text answers but cannot tell you where in the image they found the information. Kosmos-2 trades some raw question-answering accuracy for spatial grounding — a worthwhile trade for applications where location matters.