What ONNX Solves
You trained a model in PyTorch. Your inference server runs a C++ service. Your mobile team needs to run it on iOS. Without a universal format, each deployment target requires a different export pipeline.
ONNX is the universal intermediate representation for ML models. Export once, deploy anywhere: ONNX Runtime, iOS CoreML, Android NNAPI, Intel OpenVINO, NVIDIA TensorRT.
Exporting from PyTorch
import torch
import torch.onnx
model = MyModel()
model.load_state_dict(torch.load("model.pt"))
model.eval()
dummy_input = torch.randn(1, 3, 224, 224) # batch of 1 image
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17,
input_names=["image"],
output_names=["logits"],
dynamic_axes={
"image": {0: "batch_size"}, # variable batch size
"logits": {0: "batch_size"},
},
)
# Verify the export
import onnx
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)
print("ONNX model is valid")
Exporting from Scikit-learn
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.ensemble import RandomForestClassifier
import onnx
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
initial_type = [("float_input", FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("sklearn_model.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
Exporting HuggingFace Transformers with Optimum
pip install optimum[onnxruntime]
optimum-cli export onnx --model bert-base-uncased --task text-classification ./bert_onnx/
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = ORTModelForSequenceClassification.from_pretrained("./bert_onnx/")
inputs = tokenizer("This is great!", return_tensors="pt")
outputs = model(**inputs)
ONNX Runtime for Fast Inference
import onnxruntime as ort
import numpy as np
# CPU inference session
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# GPU inference (CUDA)
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Run inference
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
result = session.run(
[output_name],
{input_name: np.random.randn(1, 3, 224, 224).astype(np.float32)}
)[0]
ONNX Runtime is typically 2-5x faster than PyTorch for CPU inference due to graph optimizations and kernel fusion.
Quantization for Smaller, Faster Models
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
"model.onnx",
"model_quantized.onnx",
weight_type=QuantType.QInt8,
)
# Result: ~75% smaller model, 2-4x faster CPU inference, <1% accuracy loss
ONNX vs TorchScript vs TensorFlow SavedModel
| | ONNX | TorchScript | TF SavedModel | |---|---|---|---| | Framework support | All major | PyTorch only | TensorFlow only | | Deployment targets | Universal | PyTorch ecosystem | TF ecosystem | | Mobile support | Yes (Runtime Mobile) | Yes (iOS via LibTorch) | Yes (TF Lite) | | Quantization | Excellent | Limited | Good (TF Lite) | | Ecosystem | Growing fast | Stable | Mature |
Resources: ONNX, ONNX Runtime, Optimum.