Llama 3.2 Vision: Meta's First Multimodal Open-Source Model

Llama 3.2 introduces vision capability to the Llama family with 11B and 90B vision models, plus 1B and 3B text-only variants for on-device deployment.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 18, 2026

7 min read

// tags

#llama-3.2#meta#vision#multimodal#open-source

FIG. ART-32

7 min read

“

Llama 3.2 Vision: Meta's First Multimodal Open-Source Model

// reading plan

sections

402

words

min read

// Developer Tools

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Open Code Review is an open-source CLI tool from Alibaba that uses AI to review code changes. It runs locally, supports multiple LLMs, and costs about $0.01 per review. Here's a practical breakdown.

4 min read

// Open Source AI

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

Running With Ollama

# 11B vision model  -  runs on RTX 4090 (24GB) or M2/M3 Max Mac
ollama pull llama3.2-vision:11b

# 90B vision model  -  requires 2-4x A100s
ollama pull llama3.2-vision:90b

# Text-only on-device variants
ollama pull llama3.2:1b
ollama pull llama3.2:3b

Using Vision Capabilities

import ollama

# Image analysis
response = ollama.chat(
    model="llama3.2-vision:11b",
    messages=[
        {
            "role": "user",
            "content": "What does this diagram show? Describe the data flow.",
            "images": ["path/to/architecture-diagram.png"]
        }
    ]
)
print(response["message"]["content"])

Document Understanding

Llama 3.2 Vision handles:

Scanned documents - extract text, tables, and structure from PDFs
Charts and graphs - read data values and describe trends
Screenshots - analyze UI, identify errors, extract information
Photographs - describe content, identify objects, read text

# Via HuggingFace transformers
from transformers import MllamaForConditionalGeneration, AutoProcessor
from PIL import Image

model = MllamaForConditionalGeneration.from_pretrained(
    "meta-llama/Llama-3.2-11B-Vision-Instruct",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

image = Image.open("invoice.png")
inputs = processor(image, "Extract all line items and totals from this invoice.", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0]))

On-Device With 1B and 3B

The 1B and 3B text-only models are optimized for mobile via ExecuTorch (Meta's mobile inference framework). They fit in 500MB-1.5GB of device memory - practical for iOS and Android applications that need local inference without a network call.

Summary

Llama 3.2 brings competitive vision capability to the open-source ecosystem. The 11B vision model is particularly compelling: single-GPU, commercially licensed, and matching GPT-4o mini on MMMU. Get the weights at HuggingFace and read the release post at Meta AI.

Model	MMMU Score
Llama 3.2 90B Vision	60.3%
Llama 3.2 11B Vision	50.7%
GPT-4o mini	60.0%
Claude 3 Haiku	50.2%

Llama 3.2 Vision: Meta's First Multimodal Open-Source Model

Related Articles

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Vision Comes to the Llama Family

MMMU Benchmark

Running With Ollama

Using Vision Capabilities

Document Understanding

On-Device With 1B and 3B

Summary

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

Llama 3.2 Vision: Meta's First Multimodal Open-Source Model

Related Articles

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Vision Comes to the Llama Family

MMMU Benchmark

Running With Ollama

Using Vision Capabilities

Document Understanding

On-Device With 1B and 3B

Summary

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

The workspace your team
actually needs