The Model Serving Gap
Training a model is the easy part. Serving it reliably in production requires versioning, packaging dependencies, building an API, handling concurrency, and deploying as a container. Most teams reinvent this with Flask or FastAPI wrappers that break when dependencies change.
BentoML provides a standardized way to package any ML model as a production service.
Saving a Model
import bentoml
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Train your model (any framework)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Save to BentoML model store
saved_model = bentoml.sklearn.save_model(
"fraud_detector",
model,
signatures={"predict": {"batchable": True, "batch_dim": 0}},
metadata={"accuracy": 0.94, "trained_on": "2026-05-01"},
)
print(f"Model saved: {saved_model.tag}")
Defining a Service
# service.py
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
fraud_runner = bentoml.sklearn.get("fraud_detector:latest").to_runner()
svc = bentoml.Service("fraud_detection_service", runners=[fraud_runner])
@svc.api(input=NumpyNdarray(dtype="float32"), output=JSON())
async def predict(input_data: np.ndarray):
prediction = await fraud_runner.predict.async_run(input_data)
return {
"prediction": prediction.tolist(),
"model": "fraud_detector",
}
@svc.api(input=JSON(), output=JSON())
async def predict_from_json(input_data: dict):
features = np.array([[
input_data["amount"],
input_data["merchant_category"],
input_data["hour_of_day"],
]], dtype="float32")
prediction = await fraud_runner.predict.async_run(features)
return {"is_fraud": bool(prediction[0])}
Adaptive Batching
BentoML's runner automatically batches concurrent requests:
fraud_runner = bentoml.sklearn.get("fraud_detector:latest").to_runner()
# Runner batches requests that arrive within max_latency_ms of each other
# Configured via bentofile.yaml:
# runners:
# - name: fraud_runner
# max_batch_size: 100
# max_latency_ms: 15
This converts 100 concurrent single-item requests into one batch call — 10-50x throughput improvement for batch-capable models.
Building and Deploying with Docker
# Build the Bento (package model + service + dependencies)
bentoml build
# Build Docker image
bentoml containerize fraud_detection_service:latest
# Run locally
docker run -p 3000:3000 fraud_detection_service:latest
# Deploy to Kubernetes
kubectl apply -f k8s/deployment.yaml
The Docker image includes Python, all dependencies, the model artifacts, and the service — fully self-contained.
Auto-Generated OpenAPI
Every BentoML service exposes:
GET /— service infoPOST /predict— your API endpointGET /docs— Swagger UIGET /metrics— Prometheus metrics
BentoML vs TorchServe vs Triton
| | BentoML | TorchServe | Triton | |---|---|---|---| | Framework support | Any Python | PyTorch only | Most frameworks | | Setup complexity | Low | Medium | High | | Adaptive batching | Yes | Yes | Yes | | Multi-model | Yes | Yes | Yes | | Best for | Python-first teams | PyTorch production | High-scale inference |
Resources: BentoML GitHub, docs.