Why Size Matters: The Edge Deployment Case
Most multimodal models require 7B+ parameters to achieve useful performance on chart and document understanding. Phi-3 Vision achieves competitive results at 4.2B total parameters (3.8B language model + 0.4B CLIP vision encoder), making it viable for:
- On-device inference on iPhone (Apple Neural Engine via CoreML)
- Android deployment (ONNX Runtime with NNAPI)
- Embedded systems with limited VRAM
- Air-gapped environments where data cannot leave the device
Benchmark Numbers
| Benchmark | Phi-3 Vision 4.2B | LLaVA-1.5 7B | GPT-4V | |---|---|---|---| | MMMU | 59.8% | 36.2% | 75.1% | | TextVQA | 70.9% | 58.2% | 78.0% | | DocVQA | 82.0% | 29.0% | 87.2% | | ChartQA | 81.4% | 18.2% | 78.1% |
The most striking result is DocVQA (document visual question answering): 82.0% vs LLaVA-1.5 7B's 29.0%. Phi-3 Vision's training data included heavy emphasis on documents and charts, which is exactly the content mix that enterprise users encounter most frequently.
On ChartQA, Phi-3 Vision (81.4%) actually outperforms GPT-4V (78.1%) — remarkable for a model one-tenth the size.
Running Locally
from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
_attn_implementation="eager",
)
# Load a chart image
image = Image.open("quarterly_revenue_chart.png")
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is the revenue trend shown in this chart? Provide specific numbers."},
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512, temperature=0.0)
print(processor.decode(output[0], skip_special_tokens=True))
CoreML Export for iOS
# Export to CoreML format for on-device iOS inference
pip install coremltools
python -c "
import coremltools as ct
# Microsoft provides pre-exported CoreML packages via the ONNX Runtime iOS package
# See: https://github.com/microsoft/onnxruntime
"
Microsoft maintains official ONNX and CoreML exports of the Phi-3 family. The iOS package handles tokenization, image preprocessing, and the vision encoder — you write Swift code that calls the exported model package.
Azure AI Integration
Phi-3 Vision is available as a managed endpoint on Azure AI Studio with serverless deployment (pay-per-token, no reserved capacity):
import os
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage, ImageContentItem, TextContentItem
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(
endpoint=os.environ["AZURE_AI_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_AI_KEY"]),
)
response = client.complete(
messages=[
UserMessage(content=[
TextContentItem(text="Extract all data points from this chart as a table:"),
ImageContentItem(image_url={"url": "https://your-storage.blob.core.windows.net/chart.png"}),
])
],
model="Phi-3-vision-128k-instruct",
)
print(response.choices[0].message.content)
When to Choose Phi-3 Vision
- You need multimodal capability on a device with less than 8GB VRAM
- Document and chart understanding is the primary use case
- You need on-device inference for privacy (no data leaves the device)
- You want Azure-native deployment with Microsoft's compliance certifications