Modal: Run GPU Python Functions as Serverless Jobs in 30 Seconds

Modal lets you decorate any Python function to run on cloud GPUs with sub-5-second cold starts, persistent model caching, and OpenAI-compatible web endpoints.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 9, 2026

8 min read

// tags

#modal#serverless-gpu#python#deployment#inference

FIG. ART-23

8 min read

“

Modal: Run GPU Python Functions as Serverless Jobs in 30 Seconds

// reading plan

sections

341

words

min read

// Developer Tools

What is SpaceX Is Buying Cursor? A Practical Overview

SpaceX is buying Cursor, the AI-powered code editor. The deal signals a shift in how AI coding tools are valued and deployed. Here's a practical breakdown of what's happening and what it means for developers.

4 min read

// Developer Tools

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Persistent Volumes for Model Weights

Downloading a 7B model on every cold start would take minutes. Modal Volumes solve this:

volume = modal.Volume.from_name("model-weights", create_if_missing=True)

@app.function(gpu="A100", volumes={"/models": volume})
def download_model():
    from huggingface_hub import snapshot_download
    snapshot_download("mistralai/Mistral-7B-Instruct-v0.2", local_dir="/models/mistral-7b")
    volume.commit()

Run the download once. Subsequent function calls mount the volume and skip the download entirely.

Web Endpoints

Turn any function into an HTTP API with @modal.web_endpoint:

@app.function(gpu="T4", image=image)
@modal.web_endpoint(method="POST")
def inference_api(item: dict) -> dict:
    result = generate_text(item["prompt"])
    return {"output": result}

Modal gives you a stable HTTPS URL. Scales to zero when idle, scales up automatically under load.

Scheduled Jobs

@app.function(schedule=modal.Cron("0 8 * * *"))
def daily_report():
    # runs every day at 8 AM UTC
    generate_and_send_report()

Lambda tops out at 15-minute execution time, has no native GPU support, and cold starts on large packages (PyTorch) can take 30+ seconds. Modal was built specifically for ML workloads: GPU support is first-class, cold starts with cached images are under 5 seconds, and execution time limits are much more generous (1 hour by default). For anything involving torch or model inference, Modal is significantly less painful.

Pricing

Modal charges per second of GPU time: T4 at $0.000164/second, A10G at $0.000306/second, A100 (40GB) at $0.000900/second. No charges when functions aren't running. The free tier includes $30/month of compute.

Modal: Run GPU Python Functions as Serverless Jobs in 30 Seconds

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

The Core Pattern

Persistent Volumes for Model Weights

Web Endpoints

Scheduled Jobs

Pricing

Resources

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

Modal: Run GPU Python Functions as Serverless Jobs in 30 Seconds

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

What Is Modal?

The Core Pattern

Persistent Volumes for Model Weights

Web Endpoints

Scheduled Jobs

Modal vs AWS Lambda

Pricing

Resources

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

The workspace your team
actually needs