Phi-3 Vision: Microsoft's 4.2B Multimodal Model for Edge Devices

Phi-3 Vision packs chart understanding, document analysis, and image reasoning into 4.2 billion parameters - small enough to run on a mobile device with CoreML or ONNX, yet scoring 59.8% on MMMU.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 11, 2026

7 min read

// tags

#phi-3-vision#microsoft#multimodal#edge#chart-understanding

FIG. ART-29

7 min read

“

Phi-3 Vision: Microsoft's 4.2B Multimodal Model for Edge Devices

// reading plan

sections

445

words

min read

// Developer Tools

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Microsoft has started canceling Claude Code licenses for its employees, signaling a shift in AI tooling strategy. This post explains the context, implications, and what developers should consider.

3 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

Running Locally

from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    _attn_implementation="eager",
)

# Load a chart image
image = Image.open("quarterly_revenue_chart.png")

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is the revenue trend shown in this chart? Provide specific numbers."},
]

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=512, temperature=0.0)
print(processor.decode(output[0], skip_special_tokens=True))

CoreML Export for iOS

# Export to CoreML format for on-device iOS inference
pip install coremltools

python -c "
import coremltools as ct
# Microsoft provides pre-exported CoreML packages via the ONNX Runtime iOS package
# See: https://github.com/microsoft/onnxruntime
"

Microsoft maintains official ONNX and CoreML exports of the Phi-3 family. The iOS package handles tokenization, image preprocessing, and the vision encoder - you write Swift code that calls the exported model package.

Azure AI Integration

Phi-3 Vision is available as a managed endpoint on Azure AI Studio with serverless deployment (pay-per-token, no reserved capacity):

import os
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage, ImageContentItem, TextContentItem
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_AI_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_AI_KEY"]),
)

response = client.complete(
    messages=[
        UserMessage(content=[
            TextContentItem(text="Extract all data points from this chart as a table:"),
            ImageContentItem(image_url={"url": "https://your-storage.blob.core.windows.net/chart.png"}),
        ])
    ],
    model="Phi-3-vision-128k-instruct",
)
print(response.choices[0].message.content)

When to Choose Phi-3 Vision

You need multimodal capability on a device with less than 8GB VRAM
Document and chart understanding is the primary use case
You need on-device inference for privacy (no data leaves the device)
You want Azure-native deployment with Microsoft's compliance certifications

Benchmark	Phi-3 Vision 4.2B	LLaVA-1.5 7B	GPT-4V
MMMU	59.8%	36.2%	75.1%
TextVQA	70.9%	58.2%	78.0%
DocVQA	82.0%	29.0%	87.2%
ChartQA	81.4%	18.2%	78.1%

Phi-3 Vision: Microsoft's 4.2B Multimodal Model for Edge Devices

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Why Size Matters: The Edge Deployment Case

Benchmark Numbers

Running Locally

CoreML Export for iOS

Azure AI Integration

When to Choose Phi-3 Vision

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Phi-3 Vision: Microsoft's 4.2B Multimodal Model for Edge Devices

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Why Size Matters: The Edge Deployment Case

Benchmark Numbers

Running Locally

CoreML Export for iOS

Azure AI Integration

When to Choose Phi-3 Vision

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

The workspace your team
actually needs