Vision Comes to the Llama Family
Meta's Llama 3.2 is the first Llama release to include multimodal models. The lineup spans four sizes:
- 1B — text-only, designed for on-device mobile inference
- 3B — text-only, edge deployment with stronger reasoning
- 11B — vision + text, single GPU friendly (24GB VRAM)
- 90B — vision + text, highest capability, requires multi-GPU
The vision models use cross-attention layers added to the language backbone to process image features — a cleaner integration than adapter-based approaches.
MMMU Benchmark
The Massive Multidisciplinary Multimodal Understanding (MMMU) benchmark tests vision models on college-level questions across 30 subjects requiring both image understanding and domain knowledge:
| Model | MMMU Score | |-------|------------| | Llama 3.2 90B Vision | 60.3% | | Llama 3.2 11B Vision | 50.7% | | GPT-4o mini | 60.0% | | Claude 3 Haiku | 50.2% |
The 90B matches GPT-4o mini at a fraction of the API cost — or free if self-hosted.
Running With Ollama
# 11B vision model — runs on RTX 4090 (24GB) or M2/M3 Max Mac
ollama pull llama3.2-vision:11b
# 90B vision model — requires 2-4x A100s
ollama pull llama3.2-vision:90b
# Text-only on-device variants
ollama pull llama3.2:1b
ollama pull llama3.2:3b
Using Vision Capabilities
import ollama
# Image analysis
response = ollama.chat(
model="llama3.2-vision:11b",
messages=[
{
"role": "user",
"content": "What does this diagram show? Describe the data flow.",
"images": ["path/to/architecture-diagram.png"]
}
]
)
print(response["message"]["content"])
Document Understanding
Llama 3.2 Vision handles:
- Scanned documents — extract text, tables, and structure from PDFs
- Charts and graphs — read data values and describe trends
- Screenshots — analyze UI, identify errors, extract information
- Photographs — describe content, identify objects, read text
# Via HuggingFace transformers
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image
model = MllamaForConditionalGeneration.from_pretrained(
"meta-llama/Llama-3.2-11B-Vision-Instruct",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
image = Image.open("invoice.png")
inputs = processor(image, "Extract all line items and totals from this invoice.", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0]))
On-Device With 1B and 3B
The 1B and 3B text-only models are optimized for mobile via ExecuTorch (Meta's mobile inference framework). They fit in 500MB-1.5GB of device memory — practical for iOS and Android applications that need local inference without a network call.
Summary
Llama 3.2 brings competitive vision capability to the open-source ecosystem. The 11B vision model is particularly compelling: single-GPU, commercially licensed, and matching GPT-4o mini on MMMU. Get the weights at HuggingFace and read the release post at Meta AI.