What Is Modal?
Modal is a cloud platform that makes GPU-backed Python functions as easy to deploy as regular functions. You add a decorator, push your code, and Modal handles provisioning, scaling, and teardown. Cold starts are under 5 seconds for cached images — dramatically faster than Lambda or Cloud Run for GPU workloads.
The Core Pattern
import modal
app = modal.App("llm-inference")
# Define a container image with your dependencies
image = modal.Image.debian_slim().pip_install(
"torch", "transformers", "accelerate"
)
# This function runs on an A100 in the cloud
@app.function(gpu="A100", image=image, timeout=300)
def generate(prompt: str) -> str:
from transformers import pipeline
pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
return pipe(prompt, max_new_tokens=200)[0]["generated_text"]
@app.local_entrypoint()
def main():
result = generate.remote("Explain quantum entanglement simply:")
print(result)
Run with: modal run inference.py
Persistent Volumes for Model Weights
Downloading a 7B model on every cold start would take minutes. Modal Volumes solve this:
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
@app.function(gpu="A100", volumes={"/models": volume})
def download_model():
from huggingface_hub import snapshot_download
snapshot_download("mistralai/Mistral-7B-Instruct-v0.2", local_dir="/models/mistral-7b")
volume.commit()
Run the download once. Subsequent function calls mount the volume and skip the download entirely.
Web Endpoints
Turn any function into an HTTP API with @modal.web_endpoint:
@app.function(gpu="T4", image=image)
@modal.web_endpoint(method="POST")
def inference_api(item: dict) -> dict:
result = generate_text(item["prompt"])
return {"output": result}
Modal gives you a stable HTTPS URL. Scales to zero when idle, scales up automatically under load.
Scheduled Jobs
@app.function(schedule=modal.Cron("0 8 * * *"))
def daily_report():
# runs every day at 8 AM UTC
generate_and_send_report()
Modal vs AWS Lambda
Lambda tops out at 15-minute execution time, has no native GPU support, and cold starts on large packages (PyTorch) can take 30+ seconds. Modal was built specifically for ML workloads: GPU support is first-class, cold starts with cached images are under 5 seconds, and execution time limits are much more generous (1 hour by default). For anything involving torch or model inference, Modal is significantly less painful.
Pricing
Modal charges per second of GPU time: T4 at $0.000164/second, A10G at $0.000306/second, A100 (40GB) at $0.000900/second. No charges when functions aren't running. The free tier includes $30/month of compute.