When Should You Fine-Tune?
Fine-tuning an OpenAI model costs money upfront (training) and ongoing (inference is ~6x more expensive than the base model). Before starting, you need a clear answer to: what does fine-tuning give me that I can't get from a well-engineered prompt?
The cases where fine-tuning consistently wins:
Consistent output format — if you need JSON in a very specific schema, fine-tuning produces near-100% adherence where prompting produces 85-90%.
Style and tone consistency — mimicking a specific writing voice reliably requires fine-tuning. Few-shot examples degrade under distribution shift; fine-tuned models don't.
Shorter prompts at inference — a fine-tuned model has learned behaviors that would otherwise require lengthy system prompts. This reduces per-call cost and latency.
Domain-specific vocabulary — medical, legal, or technical terminology that the base model handles poorly can be dramatically improved with even a small fine-tuned dataset.
Fine-tuning does NOT help with factual knowledge (the model doesn't learn new facts, only new behaviors) or reasoning capability.
Supported Models
As of 2026, OpenAI supports fine-tuning for: GPT-4o mini (recommended), GPT-4o (2024-08-06 and later), GPT-3.5 Turbo, and Babbage/Davinci for legacy use cases.
Dataset Format
Fine-tuning uses JSONL where each line is a complete conversation:
{"messages": [{"role": "system", "content": "You extract product names and prices from receipts. Return JSON only."}, {"role": "user", "content": "Coffee - $4.50, Bagel - $3.00"}, {"role": "assistant", "content": "{"items": [{"name": "Coffee", "price": 4.50}, {"name": "Bagel", "price": 3.00}]}"}]}
{"messages": [{"role": "system", "content": "You extract product names and prices from receipts. Return JSON only."}, {"role": "user", "content": "Green Tea $2.75"}, {"role": "assistant", "content": "{"items": [{"name": "Green Tea", "price": 2.75}]}"}]}
Dataset size guidelines: 50 examples to see improvement, 100-500 for solid results, 1000+ for maximum performance on complex tasks.
Starting a Fine-Tuning Job
from openai import OpenAI
client = OpenAI()
# Upload training data
with open("training_data.jsonl", "rb") as f:
file = client.files.create(file=f, purpose="fine-tune")
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 2,
},
)
print(f"Job ID: {job.id}")
Hyperparameters
- n_epochs — how many passes over the training data. Start with 3. Increase if training loss is still decreasing at the end.
- batch_size — auto by default. Increase for larger datasets (reduces noise).
- learning_rate_multiplier — scales the default LR. 1-2 for most tasks; lower (0.1-0.5) if the model catastrophically forgets.
Cost Model
Training GPT-4o mini: $3.00 per 1M tokens (training tokens = tokens in all JSONL messages). Inference on your fine-tuned model: $0.30/1M input, $1.20/1M output (vs $0.15/$0.60 for base). Fine-tuning makes financial sense once the shorter prompts at inference time offset the training cost plus the inference premium.