Kosmos-2: Grounded Image Understanding That Links Text to Image Regions

Microsoft's Kosmos-2 produces bounding box coordinates inline with its text output, connecting every noun and phrase in its response to a specific region of the image.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 12, 2026

7 min read

// tags

#kosmos-2#microsoft#grounding#bounding-box#referring-expression

FIG. ART-21

7 min read

“

Kosmos-2: Grounded Image Understanding That Links Text to Image Regions

// reading plan

sections

387

words

min read

// Developer Tools

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Microsoft has started canceling Claude Code licenses for its employees, signaling a shift in AI tooling strategy. This post explains the context, implications, and what developers should consider.

3 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

Running Kosmos-2

from transformers import AutoProcessor, Kosmos2ForConditionalGeneration
from PIL import Image

model = Kosmos2ForConditionalGeneration.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

image = Image.open("street_scene.jpg")
prompt = "<grounding>Describe this image in detail:"

inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)

generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
processed_text, entities = processor.post_process_generation(generated_text)

for entity_name, (start, end), bboxes in entities:
    print(f"Entity: {entity_name}")
    for bbox in bboxes:
        x1, y1, x2, y2 = bbox
        print(f"  Box: ({x1:.2f}, {y1:.2f}) -> ({x2:.2f}, {y2:.2f})")

Referring Expression Comprehension

Kosmos-2 supports the inverse task: given a natural language description of a region, locate it. "Find the person wearing a red hat" returns a bounding box - no object detection pipeline or separate grounding model needed.

Phrase Grounding in Captions

When generating captions, Kosmos-2 links each noun phrase to its corresponding region. This creates structured captions that are both human-readable and machine-parseable - useful for building image search indices, generating accessibility descriptions, or populating product databases from catalog images.

Practical Use Cases

The grounding capability unlocks applications that pure VQAs cannot handle: natural language object detection (find a specific object without pre-defined classes), visual evidence for answers (the model shows you which region justified its answer), and extracting structured data (all product prices from a receipt, with bounding boxes for validation).

Comparison to Standard VQA Models

BLIP-2 and LLaVA give accurate text answers but cannot tell you where in the image they found the information. Kosmos-2 trades some raw question-answering accuracy for spatial grounding - a worthwhile trade for applications where location matters.

Kosmos-2: Grounded Image Understanding That Links Text to Image Regions

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

What Grounding Means for Vision-Language Models

The GRIT Dataset

Running Kosmos-2

Referring Expression Comprehension

Phrase Grounding in Captions

Practical Use Cases

Comparison to Standard VQA Models

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Kosmos-2: Grounded Image Understanding That Links Text to Image Regions

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

What Grounding Means for Vision-Language Models

The GRIT Dataset

Running Kosmos-2

Referring Expression Comprehension

Phrase Grounding in Captions

Practical Use Cases

Comparison to Standard VQA Models

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs