The OpenAI API in 2026 covers text generation, embeddings, image generation, audio transcription, and text-to-speech from a single provider. The core interface is Chat Completions. Structured outputs, function calling, the Assistants API, and the Batch API are the most significant features beyond basic text generation. This guide covers what you need to build production applications.
Model Catalog (May 2026)
| Model | Use Case | Input | Output | |---|---|---|---| | gpt-4o | Complex tasks, multimodal | $2.50/1M | $10/1M | | gpt-4o-mini | Simple tasks, high volume | $0.15/1M | $0.60/1M | | o1 | Math, complex reasoning | $15/1M | $60/1M | | o3 | Advanced reasoning | higher | higher | | text-embedding-3-small | Embeddings (fast/cheap) | $0.02/1M | — | | text-embedding-3-large | Embeddings (high quality) | $0.13/1M | — | | whisper-1 | Audio transcription | $0.006/min | — | | dall-e-3 | Image generation | per image | — | | tts-1 | Text to speech | $15/1M chars | — |
gpt-4o-mini is dramatically cheaper than gpt-4o ($0.15 vs $2.50 per million input tokens). For tasks that do not require gpt-4o's full capability — classification, summarization, simple extraction, chat responses to straightforward questions — gpt-4o-mini should be the default to control costs.
Authentication and Organization
pnpm add openai
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
organization: process.env.OPENAI_ORG_ID, // optional, for org-scoped billing
});
Chat Completions
const completion = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" },
],
temperature: 0.7, // 0 = deterministic, 1 = more random
max_tokens: 1024,
top_p: 1,
frequency_penalty: 0, // reduce repetition (-2 to 2)
presence_penalty: 0, // encourage new topics (-2 to 2)
});
const text = completion.choices[0].message.content;
Streaming
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a poem about TypeScript." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
Streaming chunks arrive as ChatCompletionChunk objects. Each chunk has a choices array with a delta object containing the incremental content.
Function Calling
Function calling lets the model call defined functions to get information or take actions. The model returns a structured request to call a function; you execute it and return the result.
const tools = [
{
type: "function" as const,
function: {
name: "get_current_weather",
description: "Get the current weather in a location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "City and country, e.g. Tokyo, Japan",
},
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
},
];
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is the weather in Tokyo?" }],
tools,
tool_choice: "auto", // let the model decide when to call tools
});
const message = response.choices[0].message;
if (message.tool_calls) {
const toolCall = message.tool_calls[0];
const args = JSON.parse(toolCall.function.arguments);
const result = await executeWeatherFunction(args);
// Continue conversation with tool result
const finalResponse = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "user", content: "What is the weather in Tokyo?" },
message, // assistant message with tool_calls
{
role: "tool",
tool_call_id: toolCall.id,
content: JSON.stringify(result),
},
],
tools,
});
}
Structured Outputs
Structured outputs guarantee the model returns JSON that matches a JSON schema. This is stronger than asking the model to "return JSON" — the API enforces the schema at the generation level.
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
const TaskSchema = z.object({
title: z.string(),
priority: z.enum(["high", "medium", "low"]),
estimatedMinutes: z.number(),
tags: z.array(z.string()),
});
const completion = await client.beta.chat.completions.parse({
model: "gpt-4o",
messages: [
{ role: "user", content: "Extract task: review the quarterly report" },
],
response_format: zodResponseFormat(TaskSchema, "task"),
});
const task = completion.choices[0].message.parsed;
// task is typed as z.infer<typeof TaskSchema>
Structured outputs require gpt-4o (not gpt-4o-mini in all cases) and use the beta namespace. The parsed result is fully typed.
Embeddings
Embeddings convert text into fixed-length vectors for semantic search, clustering, and similarity comparisons.
const response = await client.embeddings.create({
model: "text-embedding-3-small",
input: "The quick brown fox jumps over the lazy dog",
});
const vector = response.data[0].embedding; // float[]
// text-embedding-3-small: 1536 dimensions
// text-embedding-3-large: 3072 dimensions
For most RAG (retrieval augmented generation) applications, text-embedding-3-small is sufficient and significantly cheaper than text-embedding-3-large. Use text-embedding-3-large only if you have measured that it produces meaningfully better retrieval quality for your specific use case.
Assistants API vs Direct Completions
The Assistants API manages conversation threads, file uploads, and tool calling state server-side. It is useful for applications where you want OpenAI to manage conversation history instead of maintaining it yourself.
Use direct Chat Completions when: you want full control over the conversation, you are integrating with your own storage, or you need lower latency (Assistants API has more overhead).
Use the Assistants API when: you want built-in file handling, you want built-in code interpreter, or you need thread management across sessions without building your own.
Batch API: 50% Discount
The Batch API processes requests asynchronously with a 50% discount and 24-hour turnaround. Ideal for bulk processing.
// Create JSONL file with batch requests
const requests = documents.map((doc, i) => ({
custom_id: `item-${i}`,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o-mini",
messages: [{ role: "user", content: `Summarize: ${doc}` }],
max_tokens: 500,
},
}));
// Upload as file
const file = await client.files.create({
file: new File(
[requests.map((r) => JSON.stringify(r)).join("
")],
"batch.jsonl",
{ type: "application/json" }
),
purpose: "batch",
});
// Create batch
const batch = await client.batches.create({
input_file_id: file.id,
endpoint: "/v1/chat/completions",
completion_window: "24h",
});
Rate Limits and Tier Progression
OpenAI uses a tier system for rate limits. New accounts start at Tier 1 with conservative limits. Spending history unlocks higher tiers with higher rate limits.
| Tier | Monthly Spend Required | GPT-4o RPM | |---|---|---| | Tier 1 | $5 spent | 500 | | Tier 2 | $50 spent | 5,000 | | Tier 3 | $100 spent | 10,000 | | Tier 4 | $250 spent | 10,000 | | Tier 5 | $1,000 spent | 30,000 |
For production applications with high request volumes, plan for tier progression. You cannot apply to increase tiers — they unlock automatically based on spend.
Prompt Caching
OpenAI supports automatic prompt caching for long, repeated content. Unlike Anthropic's manual cache_control markers, OpenAI caches the longest common prefix of your messages automatically. Cached tokens are charged at 50% of normal input price.
For prompt caching to activate, the cached prefix must be at least 1,024 tokens. Keep your system prompts stable across requests to benefit from caching.
Cost Management Strategy
- Default to gpt-4o-mini for all new features; upgrade to gpt-4o only when quality is insufficient
- Set
max_tokenstightly — if you need 200-word responses, set max_tokens to 300, not 4096 - Use the Batch API for any non-interactive bulk processing
- Monitor usage per feature in your application code (log token counts with each request)
- Set usage limits in the OpenAI dashboard to prevent runaway costs
- Cache embeddings in your vector database rather than re-embedding the same content repeatedly
Keep Reading
- Vercel AI SDK Guide — higher-level wrapper for Next.js that simplifies OpenAI integration
- LLM API Pricing Comparison 2026 — comparing OpenAI pricing against Anthropic and others
- o1 and o3 Reasoning Models Guide — when to use o1 vs gpt-4o
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.