OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization
Complete OpenAI API reference for 2026 - model catalog, chat completions, function calling, structured outputs with JSON schema, embeddings, rate limits, and cost management.
The OpenAI API in 2026 covers text generation, embeddings, image generation, audio transcription, and text-to-speech from a single provider. The core interface is Chat Completions. Structured outputs, function calling, the Assistants API, and the Batch API are the most significant features beyond basic text generation. This guide covers what you need to build production applications.
Model Catalog (May 2026)
Model
Use Case
Input
Output
gpt-4o
Complex tasks, multimodal
$2.50/1M
$10/1M
gpt-4o-mini
Simple tasks, high volume
$0.15/1M
$0.60/1M
o1
Math, complex reasoning
$15/1M
$60/1M
o3
Advanced reasoning
higher
higher
text-embedding-3-small
Embeddings (fast/cheap)
$0.02/1M
-
text-embedding-3-large
Embeddings (high quality)
$0.13/1M
-
whisper-1
Audio transcription
$0.006/min
-
dall-e-3
Image generation
per image
-
tts-1
Text to speech
$15/1M chars
-
gpt-4o-mini is dramatically cheaper than gpt-4o ($0.15 vs $2.50 per million input tokens). For tasks that do not require gpt-4o's full capability - classification, summarization, simple extraction, chat responses to straightforward questions - gpt-4o-mini should be the default to control costs.
Authentication and Organization
pnpm add openai
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
organization: process.env.OPENAI_ORG_ID, // optional, for org-scoped billing
});
// stay current
AI & ML insights, weekly
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
const completion = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" },
],
temperature: 0.7, // 0 = deterministic, 1 = more random
max_tokens: 1024,
top_p: 1,
frequency_penalty: 0, // reduce repetition (-2 to 2)
presence_penalty: 0, // encourage new topics (-2 to 2)
});
const text = completion.choices[0].message.content;
Streaming
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a poem about TypeScript." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
Streaming chunks arrive as ChatCompletionChunk objects. Each chunk has a choices array with a delta object containing the incremental content.
Function Calling
Function calling lets the model call defined functions to get information or take actions. The model returns a structured request to call a function; you execute it and return the result.
const tools = [
{
type: "function" as const,
function: {
name: "get_current_weather",
description: "Get the current weather in a location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "City and country, e.g. Tokyo, Japan",
},
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
},
];
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is the weather in Tokyo?" }],
tools,
tool_choice: "auto", // let the model decide when to call tools
});
const message = response.choices[0].message;
if (message.tool_calls) {
const toolCall = message.tool_calls[0];
const args = JSON.parse(toolCall.function.arguments);
const result = await executeWeatherFunction(args);
// Continue conversation with tool result
const finalResponse = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "user", content: "What is the weather in Tokyo?" },
message, // assistant message with tool_calls
{
role: "tool",
tool_call_id: toolCall.id,
content: JSON.stringify(result),
},
],
tools,
});
}
Structured Outputs
Structured outputs guarantee the model returns JSON that matches a JSON schema. This is stronger than asking the model to "return JSON" - the API enforces the schema at the generation level.
Structured outputs require gpt-4o (not gpt-4o-mini in all cases) and use the beta namespace. The parsed result is fully typed.
Embeddings
Embeddings convert text into fixed-length vectors for semantic search, clustering, and similarity comparisons.
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: "The quick brown fox jumps over the lazy dog",
});
const vector = response.data[0].embedding; // float[]
// text-embedding-3-small: 1536 dimensions
// text-embedding-3-large: 3072 dimensions
For most RAG (retrieval augmented generation) applications, text-embedding-3-small is sufficient and significantly cheaper than text-embedding-3-large. Use text-embedding-3-large only if you have measured that it produces meaningfully better retrieval quality for your specific use case.
Assistants API vs Direct Completions
The Assistants API manages conversation threads, file uploads, and tool calling state server-side. It is useful for applications where you want OpenAI to manage conversation history instead of maintaining it yourself.
Use direct Chat Completions when: you want full control over the conversation, you are integrating with your own storage, or you need lower latency (Assistants API has more overhead).
Use the Assistants API when: you want built-in file handling, you want built-in code interpreter, or you need thread management across sessions without building your own.
Batch API: 50% Discount
The Batch API processes requests asynchronously with a 50% discount and 24-hour turnaround. Ideal for bulk processing.
OpenAI uses a tier system for rate limits. New accounts start at Tier 1 with conservative limits. Spending history unlocks higher tiers with higher rate limits.
Tier
Monthly Spend Required
GPT-4o RPM
Tier 1
$5 spent
500
Tier 2
$50 spent
5,000
Tier 3
$100 spent
10,000
Tier 4
$250 spent
10,000
Tier 5
$1,000 spent
30,000
For production applications with high request volumes, plan for tier progression. You cannot apply to increase tiers - they unlock automatically based on spend.
Prompt Caching
OpenAI supports automatic prompt caching for long, repeated content. Unlike Anthropic's manual cache_control markers, OpenAI caches the longest common prefix of your messages automatically. Cached tokens are charged at 50% of normal input price.
For prompt caching to activate, the cached prefix must be at least 1,024 tokens. Keep your system prompts stable across requests to benefit from caching.
Cost Management Strategy
Default to gpt-4o-mini for all new features; upgrade to gpt-4o only when quality is insufficient
Set max_tokens tightly - if you need 200-word responses, set max_tokens to 300, not 4096
Use the Batch API for any non-interactive bulk processing
Monitor usage per feature in your application code (log token counts with each request)
Set usage limits in the OpenAI dashboard to prevent runaway costs
Cache embeddings in your vector database rather than re-embedding the same content repeatedly
Keep Reading
Vercel AI SDK Guide - higher-level wrapper for Next.js that simplifies OpenAI integration
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Frequently Asked Questions
What is the OpenAI API Guide 2026?
The OpenAI API Guide 2026 is a comprehensive reference covering all OpenAI API features including model catalog (GPT-4o, GPT-4o-mini, o1, o3), chat completions, structured outputs, function calling, embeddings, Batch API, rate limits, and cost optimization. It provides code examples and best practices for building production applications.
How does the OpenAI API work?
The OpenAI API works by sending HTTP requests to endpoints like /v1/chat/completions with a model name and messages. The API returns generated text or structured data. You can stream responses, call functions, and enforce JSON schemas. Authentication is via API key. Pricing is per token (input and output).
What are the best practices for using the OpenAI API?
Best practices include: default to GPT-4o-mini for cost savings, set max_tokens tightly, use the Batch API for bulk processing (50% discount), enable prompt caching by keeping system prompts stable, monitor token usage per feature, and set usage limits in the dashboard. For structured outputs, use zodResponseFormat to enforce JSON schemas.
How much does the OpenAI API cost in 2026?
Pricing varies by model: GPT-4o costs $2.50/1M input tokens and $10/1M output; GPT-4o-mini costs $0.15/1M input and $0.60/1M output; o1 costs $15/1M input and $60/1M output; text-embedding-3-small costs $0.02/1M tokens. The Batch API offers a 50% discount with 24-hour turnaround.
Is the OpenAI API worth it in 2026?
Yes, the OpenAI API remains a top choice for production AI applications due to its broad model selection, reliability, and features like structured outputs and function calling. However, consider alternatives like Anthropic or open-source models if you need specific capabilities or lower costs. For most use cases, GPT-4o-mini offers excellent value.
What is the difference between GPT-4o and GPT-4o-mini?
GPT-4o is a full-sized model optimized for complex tasks and multimodal inputs, costing $2.50/1M input tokens. GPT-4o-mini is a smaller, faster, and much cheaper model ($0.15/1M input) suitable for simple tasks like classification, summarization, and straightforward chat. Use GPT-4o-mini as default and upgrade only when quality is insufficient.
How do structured outputs work in the OpenAI API?
Structured outputs allow you to define a JSON schema (e.g., using Zod) and the API guarantees the response matches that schema. Use the beta endpoint with response_format set via zodResponseFormat. This is more reliable than asking the model to return JSON. Requires GPT-4o or later models.
What is the Batch API and how do I use it?
The Batch API lets you submit asynchronous requests at a 50% discount with a 24-hour completion window. You create a JSONL file with custom_id, method, url, and body, upload it as a file, then create a batch. Ideal for bulk processing like summarization or classification. Use GPT-4o-mini for maximum cost savings.