A trained model that lives in a Jupyter notebook generates zero value. Getting a model to production is a distinct engineering challenge from training it, and the pattern you choose determines your latency, cost, throughput, and operational complexity. This guide covers the deployment patterns used in production ML systems.
Pattern 1: REST API Serving
The simplest deployment pattern: wrap your model in a REST API that accepts input, runs inference, and returns a prediction. A Python web framework (FastAPI is the standard choice) handles HTTP, your model handles inference.
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
model = joblib.load("model.pkl")
class PredictRequest(BaseModel):
features: list[float]
@app.post("/predict")
def predict(request: PredictRequest):
X = np.array(request.features).reshape(1, -1)
prediction = model.predict(X)[0]
probability = model.predict_proba(X)[0].tolist()
return {"prediction": int(prediction), "probabilities": probability}
For LLM inference, the same pattern applies but you add streaming output (Server-Sent Events or WebSockets) because token generation is inherently sequential.
When to use: real-time predictions where a user is waiting for a response. Recommendation systems, fraud detection at checkout, content classification on upload.
Caching: add a cache layer (Redis with a reasonable TTL) in front of the model for identical inputs. Many production systems get 40-60% cache hit rates on prediction APIs, reducing both latency and compute cost.
Scaling: REST API serving scales horizontally — add more containers behind a load balancer. For LLMs, a single large model may not fit in one GPU; tensor parallelism (splitting the model across multiple GPUs) or pipeline parallelism is required.
Pattern 2: Batch Inference
Batch inference processes a large dataset offline rather than serving individual predictions in real time. Run it on a schedule (nightly, hourly) to pre-compute predictions for all users or items.
Use cases: pre-computing product recommendations for all users and storing them in a database to be served at page load. Pre-scoring leads for the sales team each morning. Running a document classifier over all documents uploaded in the past day.
Batch inference is dramatically cheaper than real-time serving because you can use larger batch sizes (better GPU utilization), run during off-peak hours (spot instances on cloud providers can be 70-90% cheaper than on-demand), and there is no latency constraint — a job that takes 2 hours is fine if it runs overnight.
Implementation: a Python script with a data pipeline (Spark, Pandas, DuckDB depending on data size), a model inference loop with large batch sizes, and output written to a database or object store.
import pandas as pd
from model import load_model
model = load_model("model.pt")
df = pd.read_parquet("users.parquet")
batch_size = 512
predictions = []
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i+batch_size]
features = batch[FEATURE_COLUMNS].values
preds = model.predict(features)
predictions.extend(preds)
df["churn_score"] = predictions
df[["user_id", "churn_score"]].to_sql("churn_predictions", engine, if_exists="replace")
Pattern 3: Streaming Inference
Streaming inference processes data as it arrives, typically from a message queue. A Kafka topic receives events, a consumer reads them, runs inference, and writes results back to another topic or a database.
Use cases: content moderation on a high-volume social platform (moderate posts as they are submitted). Anomaly detection in time-series data from IoT sensors. Real-time recommendation updates as users interact with content.
Architecture: Kafka (or Pulsar, Kinesis, or Pub/Sub) as the message bus. A consumer service built with Kafka's Python client pulls batches of messages, runs inference, and publishes results. The consumer can scale horizontally by adding more consumer instances in the same consumer group.
The key operational challenge: managing consumer lag. If the model is slower than the event ingestion rate, the queue grows unboundedly. Monitor consumer lag and scale consumers proactively.
Pattern 4: Edge Deployment
Edge deployment runs the model on the device where the data is generated — a phone, a laptop, an embedded sensor — rather than sending data to a server.
Advantages: no network latency, works offline, data stays on device (privacy), no per-inference cloud compute cost.
Disadvantages: smaller models only (constrained memory and compute), updates require pushing new model to devices, harder to debug and monitor.
Frameworks for edge deployment:
TensorFlow Lite — convert TensorFlow/Keras models to a compact flat binary format (.tflite). Optimized for inference on mobile and embedded devices. Has hardware acceleration on iOS (Core ML delegate) and Android (GPU delegate, NNAPI).
ONNX — a common model format supported by all major frameworks. Convert from PyTorch (torch.onnx.export), TensorFlow, or scikit-learn. Run inference with ONNX Runtime, which is optimized for CPU inference on any platform.
Core ML — Apple's native ML inference framework for iOS and macOS. Convert models with coremltools. Runs on the Apple Neural Engine for exceptional efficiency on Apple Silicon.
MediaPipe — Google's framework for ML pipelines on mobile and edge. Pre-built solutions for face detection, hand tracking, pose estimation, and object detection that run on-device.
Quantization is almost always applied before edge deployment: reduce model weights from FP32 to INT8 or even INT4. This cuts model size by 4-8x with typically less than 2% accuracy loss, making large models fit in constrained memory.
Pattern 5: Model Serving Frameworks
As model serving complexity grows beyond a single FastAPI endpoint, purpose-built serving frameworks become valuable.
vLLM is the standard for serving LLMs at scale. It implements PagedAttention, an optimized KV-cache memory management system that dramatically increases throughput for concurrent requests. A single A100 serving Llama 3 70B with vLLM can handle 10-20x more concurrent requests than a naive implementation.
TorchServe is the official PyTorch model serving framework. It handles model versioning, multi-model serving, A/B testing, and metrics out of the box. Well-suited for production deployments of PyTorch models.
NVIDIA Triton Inference Server supports multiple frameworks (TensorFlow, PyTorch, ONNX, TensorRT) on the same server. It maximizes GPU utilization through dynamic batching — batching together requests that arrive close in time. Ideal for organizations with diverse model portfolios on NVIDIA hardware.
Ollama is a lightweight local LLM server that pulls models from a registry and serves them via a local REST API. Excellent for development, internal tools, and low-volume production use cases.
Canary Deployments for ML
ML model updates carry risk that pure software updates do not — a new model version might perform better on average but worse on a specific user segment, and this might not be apparent until it has been running for days. Canary deployments mitigate this risk.
A canary deployment routes a small percentage of traffic (5%, 10%) to the new model version while the rest continues to hit the current version. Monitor both versions in parallel:
- Prediction distribution (has the output distribution shifted?)
- Latency (is the new model slower?)
- Business metrics (conversion rate, click-through rate, downstream outcomes)
- Error rate (prediction failures, timeouts)
If all metrics look good after 24-48 hours, gradually increase the canary percentage. If problems appear, roll back immediately — traffic is already partially going to the old model, so rollback is instant.
Implement canary routing at the load balancer or API gateway level. Most modern infrastructure tools (Kubernetes with Istio or Argo Rollouts, AWS AppMesh, NGINX with split_clients) support traffic percentage routing natively.
Keep Reading
- ML Monitoring and Data Drift Detection — what to monitor after deployment
- A/B Testing ML Models in Production — measuring the impact of new model versions
- ML Model Compression Guide — making models small enough to deploy at the edge
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.