Pixtral 12B: Mistral's First Vision-Language Model

Pixtral 12B processes images at arbitrary resolution without fixed patch sizes, scores 52.5% on MMMU, and is available under an Apache 2.0 license.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 9, 2026

7 min read

// tags

#pixtral#mistral#vision#vlm#multimodal

FIG. ART-32

7 min read

“

Pixtral 12B: Mistral's First Vision-Language Model

// reading plan

sections

425

words

min read

// LLMs & Language Models

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

Running LLMs locally is no longer just for developers. We benchmark the latency, memory usage, and reasoning quality of 2026's top open models.

10 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

Benchmark Results

Benchmark	Pixtral 12B	LLaVA-1.6 34B	InternVL2 8B	GPT-4o mini
MMMU	52.5%	49.9%	51.2%	60.0%
MathVista	58.0%	46.5%	54.7%	52.4%
ChartQA	81.8%	65.5%	83.3%	85.7%
DocVQA	90.1%	78.2%	91.5%	88.5%

Pixtral beats LLaVA-1.6 34B (nearly 3x larger) on every benchmark - a strong result for a 12B model.

Using the Mistral API

from mistralai import Mistral
import base64

client = Mistral(api_key="your-api-key")

# Load image as base64
with open("chart.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.complete(
    model="pixtral-12b-2409",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": f"data:image/png;base64,{image_b64}"
                },
                {
                    "type": "text",
                    "text": "Extract all data values from this bar chart as a JSON array."
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)

Self-Hosting With vLLM

pip install vllm

# Start vLLM server with Pixtral
python -m vllm.entrypoints.openai.api_server     --model mistralai/Pixtral-12B-2409     --tokenizer_mode mistral     --config_format mistral     --load_format mistral     --max-model-len 32768     --gpu-memory-utilization 0.9

The vLLM server exposes an OpenAI-compatible endpoint - switch base_url in your existing code.

Use Cases

Invoice and receipt processing: Extract line items, totals, and vendor details from scanned documents with high accuracy.

Technical diagram analysis: Parse architecture diagrams, circuit schematics, or flowcharts and convert to structured descriptions.

Chart data extraction: Pull numeric values from bar charts, line graphs, and tables for automated reporting pipelines.

Screenshot understanding: Analyze UI screenshots for accessibility audits, bug reports, or automated testing.

Summary

Pixtral 12B is the strongest Apache 2.0 licensed vision model in its size class. The variable resolution approach is a genuine architectural improvement over fixed-patch models. Access it via Mistral API or download weights from HuggingFace.

Pixtral 12B: Mistral's First Vision-Language Model

Related Articles

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

Mistral Enters the Vision Space

Arbitrary Resolution Image Processing

Benchmark Results

Using the Mistral API

Self-Hosting With vLLM

Use Cases

Summary

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Pixtral 12B: Mistral's First Vision-Language Model

Related Articles

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

Mistral Enters the Vision Space

Arbitrary Resolution Image Processing

Benchmark Results

Using the Mistral API

Self-Hosting With vLLM

Use Cases

Summary

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

The workspace your team
actually needs