OpenAI's Batch API provides a 50% cost reduction on all model pricing for requests that do not need real-time responses. You submit a JSONL file of requests, OpenAI processes them within 24 hours, and you pay half the standard per-token rate. For any workload that does not require an immediate response — data labeling, bulk analysis, nightly reports, content moderation — this is the most straightforward cost reduction available.
How the Batch API Works
The Batch API is not a different endpoint. It is the same models (GPT-4o, GPT-4o-mini, text-embedding-3-small, and others) with a different pricing model in exchange for relaxed latency requirements.
The workflow:
- Create a JSONL file where each line is one API request
- Upload the file to OpenAI's Files API
- Create a batch job referencing the uploaded file
- Poll the batch job status (or set a callback)
- When complete, download the results JSONL file
from openai import OpenAI
import json
client = OpenAI()
# Step 1: Create request JSONL
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Classify the sentiment of the following text as positive, negative, or neutral."},
{"role": "user", "content": "The product arrived on time and works perfectly."}
],
"max_tokens": 10
}
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Classify the sentiment of the following text as positive, negative, or neutral."},
{"role": "user", "content": "Terrible experience, would not recommend."}
],
"max_tokens": 10
}
}
]
# Write to JSONL file
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "
")
# Step 2: Upload the file
with open("batch_requests.jsonl", "rb") as f:
batch_file = client.files.create(file=f, purpose="batch")
# Step 3: Create batch job
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch job created: {batch_job.id}")
Checking Status and Retrieving Results
import time
def wait_for_batch(batch_id: str, poll_interval: int = 60):
while True:
batch = client.batches.retrieve(batch_id)
print(f"Status: {batch.status}, completed: {batch.request_counts.completed}/{batch.request_counts.total}")
if batch.status == "completed":
return batch
elif batch.status in ["failed", "expired", "cancelled"]:
raise Exception(f"Batch failed with status: {batch.status}")
time.sleep(poll_interval)
# Wait for completion (in production, use a scheduled job or webhook)
completed_batch = wait_for_batch(batch_job.id)
# Download results
result_file = client.files.content(completed_batch.output_file_id)
results = [json.loads(line) for line in result_file.text.strip().split("
")]
for result in results:
custom_id = result["custom_id"]
response_content = result["response"]["body"]["choices"][0]["message"]["content"]
print(f"{custom_id}: {response_content}")
Pricing: Actual Savings
The batch pricing is 50% of the standard price. As of May 2026:
| Model | Standard Input | Batch Input | Standard Output | Batch Output | |-------|---------------|-------------|-----------------|--------------| | GPT-4o | $2.50/1M | $1.25/1M | $10.00/1M | $5.00/1M | | GPT-4o-mini | $0.15/1M | $0.075/1M | $0.60/1M | $0.30/1M | | text-embedding-3-small | $0.02/1M | $0.01/1M | — | — |
For a data labeling workload processing 100 million tokens per month on GPT-4o-mini, the savings are $7.50/month ($15 standard vs. $7.50 batch). For the same workload on GPT-4o, savings are $125/month. For high-volume embedding workloads, batch API cuts embedding costs in half.
Use Cases That Are Perfect for Batch API
Sentiment analysis at scale. If you process customer feedback, support tickets, or social media mentions, these can all be batched and processed overnight.
Document processing pipelines. Summarizing documents, extracting entities, classifying content — anything that processes a backlog of documents on a schedule rather than in real time.
Data augmentation for ML. Generating synthetic training data, labeling examples, generating variations — all high-volume, non-real-time workloads.
Nightly report generation. Generating summaries of daily activity, flagging anomalies, creating management reports — runs overnight and is ready in the morning.
Bulk content moderation. For platforms that moderate user-generated content, batch processing of older content that does not need immediate review.
Limits and Considerations
Batch jobs expire after 24 hours if not completed. If OpenAI is under high load, your job might not complete within the window — though in practice this is rare.
The maximum batch size is 50,000 requests or 200MB of input data per job, whichever is smaller. For larger workloads, split into multiple batch jobs.
Batch requests must use the /v1/chat/completions or /v1/embeddings endpoint. Other endpoints (like images or audio) are not supported.
There is no webhook support for batch completion — you must poll. In production, implement a scheduled job (cron, Celery beat, or similar) to check batch status every 15-30 minutes.
Keep Reading
- Anthropic Batch API Guide — The same approach for Claude models.
- Cutting LLM API Costs: The Complete Guide — Full framework combining all cost reduction strategies.
- LLM API Pricing Comparison 2026 — Current pricing across providers to identify your highest-cost workloads.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.