Small Models With Textbook Training
Microsoft Phi-4 continues the Phi series' thesis: model quality is determined more by training data quality than by raw parameter count. Phi-4's 14 billion parameters achieve remarkable STEM reasoning by training primarily on synthetically generated "textbook-quality" problem-solution pairs rather than raw web scrapes.
MATH benchmark: 80.4% vs GPT-4o at 76.6%. A 14B model beating a frontier model on mathematical reasoning is a meaningful signal about what's possible with careful data engineering.
Architecture and Training
- 14 billion parameters — fits on a single RTX 4090 (24GB) in FP16, or a single 3090 in INT4
- 16k token context window — smaller than many competitors but sufficient for most STEM tasks
- Training data: ~9.8 trillion tokens, heavily weighted toward synthetic STEM content
- Azure AI deployment for production, HuggingFace for research
The full technical report details the synthetic data pipeline, which generates progressively harder problems across mathematics, physics, chemistry, and computer science.
Benchmark Results
| Benchmark | Phi-4 (14B) | GPT-4o | Llama 3.1 70B | Gemma 2 27B | |-----------|-------------|--------|---------------|-------------| | MATH | 80.4% | 76.6% | 73.8% | 56.0% | | MMLU | 84.8% | 88.7% | 83.6% | 75.2% | | HumanEval | 82.6% | 90.2% | 80.5% | 72.0% | | GPQA | 56.1% | 53.6% | 46.7% | 42.1% |
Note that MMLU is the one area where GPT-4o clearly leads — Phi-4's knowledge breadth is narrower than frontier models, but within STEM it outperforms them.
Running on Azure
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.inference.models import SystemMessage, UserMessage
client = ChatCompletionsClient(
endpoint="https://your-endpoint.services.ai.azure.com/models",
credential=AzureKeyCredential("your-key"),
)
response = client.complete(
model="Phi-4",
messages=[
SystemMessage("You are a helpful STEM tutor."),
UserMessage("Solve this differential equation: dy/dx = 2xy, y(0) = 1")
],
max_tokens=1024,
temperature=0.1,
)
print(response.choices[0].message.content)
Running Locally
# Via Ollama
ollama pull phi4
ollama run phi4 "Prove the Pythagorean theorem using similar triangles."
# Via HuggingFace transformers
pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-4",
torch_dtype=torch.bfloat16,
device_map="auto"
)
inputs = tokenizer("Calculate the eigenvalues of [[2,1],[1,2]]:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(output[0], skip_special_tokens=True))
When to Choose Phi-4
Phi-4 is the right choice for:
- Math tutoring applications
- Scientific computing assistance
- STEM homework help platforms
- Code generation for algorithmic problems
- Any application where MATH/reasoning performance matters more than broad knowledge
Summary
Phi-4 proves that the frontier isn't defined by model size. For STEM-focused applications, it's faster, cheaper, and more accurate than models 5-6x its size. Access it via Azure or download from HuggingFace.