OpenAI Batch API: Get 50% Off for Non-Real-Time Requests

OpenAI's Batch API cuts costs by 50% for any request that can wait up to 24 hours. If you have data labeling, nightly analysis, or content moderation workloads, you should be using it.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#openai-batch-api#llm-cost#api-optimization#gpt-4o

FIG. ART-22

8 min read

“

OpenAI Batch API: Get 50% Off for Non-Real-Time Requests

// reading plan

sections

783

words

min read

// LLM & Language Models

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

An honest, benchmark-driven comparison of Claude 3.5 Sonnet vs GPT-4o covering coding, document analysis, multimodal tasks, pricing, and real-world verdict.

7 min read

// LLM & Language Models

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

OpenAI's Batch API provides a 50% cost reduction on all model pricing for requests that do not need real-time responses. You submit a JSONL file of requests, OpenAI processes them within 24 hours, and you pay half the standard per-token rate. For any workload that does not require an immediate response — data labeling, bulk analysis, nightly reports, content moderation — this is the most straightforward cost reduction available.

How the Batch API Works

The Batch API is not a different endpoint. It is the same models (GPT-4o, GPT-4o-mini, text-embedding-3-small, and others) with a different pricing model in exchange for relaxed latency requirements.

The workflow:

Create a JSONL file where each line is one API request
Upload the file to OpenAI's Files API
Create a batch job referencing the uploaded file
Poll the batch job status (or set a callback)
When complete, download the results JSONL file

from openai import OpenAI
import json

client = OpenAI()

# Step 1: Create request JSONL
requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify the sentiment of the following text as positive, negative, or neutral."},
                {"role": "user", "content": "The product arrived on time and works perfectly."}
            ],
            "max_tokens": 10
        }
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify the sentiment of the following text as positive, negative, or neutral."},
                {"role": "user", "content": "Terrible experience, would not recommend."}
            ],
            "max_tokens": 10
        }
    }
]

# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "
")

# Step 2: Upload the file
with open("batch_requests.jsonl", "rb") as f:
    batch_file = client.files.create(file=f, purpose="batch")

# Step 3: Create batch job
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch job created: {batch_job.id}")

Checking Status and Retrieving Results

import time

def wait_for_batch(batch_id: str, poll_interval: int = 60):
    while True:
        batch = client.batches.retrieve(batch_id)
        print(f"Status: {batch.status}, completed: {batch.request_counts.completed}/{batch.request_counts.total}")

        if batch.status == "completed":
            return batch
        elif batch.status in ["failed", "expired", "cancelled"]:
            raise Exception(f"Batch failed with status: {batch.status}")

        time.sleep(poll_interval)

# Wait for completion (in production, use a scheduled job or webhook)
completed_batch = wait_for_batch(batch_job.id)

# Download results
result_file = client.files.content(completed_batch.output_file_id)
results = [json.loads(line) for line in result_file.text.strip().split("
")]

for result in results:
    custom_id = result["custom_id"]
    response_content = result["response"]["body"]["choices"][0]["message"]["content"]
    print(f"{custom_id}: {response_content}")

Pricing: Actual Savings

The batch pricing is 50% of the standard price. As of May 2026:

| Model | Standard Input | Batch Input | Standard Output | Batch Output | |-------|---------------|-------------|-----------------|--------------| | GPT-4o | $2.50/1M | $1.25/1M | $10.00/1M | $5.00/1M | | GPT-4o-mini | $0.15/1M | $0.075/1M | $0.60/1M | $0.30/1M | | text-embedding-3-small | $0.02/1M | $0.01/1M | — | — |

For a data labeling workload processing 100 million tokens per month on GPT-4o-mini, the savings are $7.50/month ($15 standard vs. $7.50 batch). For the same workload on GPT-4o, savings are $125/month. For high-volume embedding workloads, batch API cuts embedding costs in half.

Use Cases That Are Perfect for Batch API

Sentiment analysis at scale. If you process customer feedback, support tickets, or social media mentions, these can all be batched and processed overnight.

Document processing pipelines. Summarizing documents, extracting entities, classifying content — anything that processes a backlog of documents on a schedule rather than in real time.

Data augmentation for ML. Generating synthetic training data, labeling examples, generating variations — all high-volume, non-real-time workloads.

Nightly report generation. Generating summaries of daily activity, flagging anomalies, creating management reports — runs overnight and is ready in the morning.

Bulk content moderation. For platforms that moderate user-generated content, batch processing of older content that does not need immediate review.

Limits and Considerations

Batch jobs expire after 24 hours if not completed. If OpenAI is under high load, your job might not complete within the window — though in practice this is rare.

The maximum batch size is 50,000 requests or 200MB of input data per job, whichever is smaller. For larger workloads, split into multiple batch jobs.

Batch requests must use the /v1/chat/completions or /v1/embeddings endpoint. Other endpoints (like images or audio) are not supported.

There is no webhook support for batch completion — you must poll. In production, implement a scheduled job (cron, Celery beat, or similar) to check batch status every 15-30 minutes.

Keep Reading

Anthropic Batch API Guide — The same approach for Claude models.
Cutting LLM API Costs: The Complete Guide — Full framework combining all cost reduction strategies.
LLM API Pricing Comparison 2026 — Current pricing across providers to identify your highest-cost workloads.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

OpenAI Batch API: Get 50% Off for Non-Real-Time Requests

Related Articles

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

How the Batch API Works

Checking Status and Retrieving Results

Pricing: Actual Savings

Use Cases That Are Perfect for Batch API

Limits and Considerations

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

OpenAI Batch API: Get 50% Off for Non-Real-Time Requests

Related Articles

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

How the Batch API Works

Checking Status and Retrieving Results

Pricing: Actual Savings

Use Cases That Are Perfect for Batch API

Limits and Considerations

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLM Knowledge Cutoffs: What They Mean and How to Work Around Them

OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

The workspace your team
actually needs