OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

Complete OpenAI API reference for 2026 — model catalog, chat completions, function calling, structured outputs with JSON schema, embeddings, rate limits, and cost management.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#openai#gpt-4o#api#structured-outputs#embeddings

FIG. ART-16

10 min read

“

OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

// reading plan

sections

1,263

words

min read

// Developer Tools

Testing HTTP APIs Effectively: Beyond the Happy Path

Unit vs integration tests, test database strategies, auth in tests, and making sure your 400, 401, 403, 404, and 500 responses are all verified.

10 min read

// Machine Learning

Building Semantic Search: Finding Results by Meaning, Not Keywords

The OpenAI API in 2026 covers text generation, embeddings, image generation, audio transcription, and text-to-speech from a single provider. The core interface is Chat Completions. Structured outputs, function calling, the Assistants API, and the Batch API are the most significant features beyond basic text generation. This guide covers what you need to build production applications.

Model Catalog (May 2026)

| Model | Use Case | Input | Output | |---|---|---|---| | gpt-4o | Complex tasks, multimodal | $2.50/1M | $10/1M | | gpt-4o-mini | Simple tasks, high volume | $0.15/1M | $0.60/1M | | o1 | Math, complex reasoning | $15/1M | $60/1M | | o3 | Advanced reasoning | higher | higher | | text-embedding-3-small | Embeddings (fast/cheap) | $0.02/1M | — | | text-embedding-3-large | Embeddings (high quality) | $0.13/1M | — | | whisper-1 | Audio transcription | $0.006/min | — | | dall-e-3 | Image generation | per image | — | | tts-1 | Text to speech | $15/1M chars | — |

gpt-4o-mini is dramatically cheaper than gpt-4o ($0.15 vs $2.50 per million input tokens). For tasks that do not require gpt-4o's full capability — classification, summarization, simple extraction, chat responses to straightforward questions — gpt-4o-mini should be the default to control costs.

Authentication and Organization

pnpm add openai

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  organization: process.env.OPENAI_ORG_ID,  // optional, for org-scoped billing
});

Chat Completions

const completion = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is the capital of France?" },
  ],
  temperature: 0.7,       // 0 = deterministic, 1 = more random
  max_tokens: 1024,
  top_p: 1,
  frequency_penalty: 0,   // reduce repetition (-2 to 2)
  presence_penalty: 0,    // encourage new topics (-2 to 2)
});

const text = completion.choices[0].message.content;

Streaming

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a poem about TypeScript." }],
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

Streaming chunks arrive as ChatCompletionChunk objects. Each chunk has a choices array with a delta object containing the incremental content.

Function Calling

Function calling lets the model call defined functions to get information or take actions. The model returns a structured request to call a function; you execute it and return the result.

const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_current_weather",
      description: "Get the current weather in a location",
      parameters: {
        type: "object",
        properties: {
          location: {
            type: "string",
            description: "City and country, e.g. Tokyo, Japan",
          },
          unit: { type: "string", enum: ["celsius", "fahrenheit"] },
        },
        required: ["location"],
      },
    },
  },
];

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "What is the weather in Tokyo?" }],
  tools,
  tool_choice: "auto",  // let the model decide when to call tools
});

const message = response.choices[0].message;
if (message.tool_calls) {
  const toolCall = message.tool_calls[0];
  const args = JSON.parse(toolCall.function.arguments);
  const result = await executeWeatherFunction(args);

  // Continue conversation with tool result
  const finalResponse = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "user", content: "What is the weather in Tokyo?" },
      message,  // assistant message with tool_calls
      {
        role: "tool",
        tool_call_id: toolCall.id,
        content: JSON.stringify(result),
      },
    ],
    tools,
  });
}

Structured Outputs

Structured outputs guarantee the model returns JSON that matches a JSON schema. This is stronger than asking the model to "return JSON" — the API enforces the schema at the generation level.

import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";

const TaskSchema = z.object({
  title: z.string(),
  priority: z.enum(["high", "medium", "low"]),
  estimatedMinutes: z.number(),
  tags: z.array(z.string()),
});

const completion = await client.beta.chat.completions.parse({
  model: "gpt-4o",
  messages: [
    { role: "user", content: "Extract task: review the quarterly report" },
  ],
  response_format: zodResponseFormat(TaskSchema, "task"),
});

const task = completion.choices[0].message.parsed;
// task is typed as z.infer<typeof TaskSchema>

Structured outputs require gpt-4o (not gpt-4o-mini in all cases) and use the beta namespace. The parsed result is fully typed.

Embeddings

Embeddings convert text into fixed-length vectors for semantic search, clustering, and similarity comparisons.

const response = await client.embeddings.create({
  model: "text-embedding-3-small",
  input: "The quick brown fox jumps over the lazy dog",
});

const vector = response.data[0].embedding;  // float[]
// text-embedding-3-small: 1536 dimensions
// text-embedding-3-large: 3072 dimensions

For most RAG (retrieval augmented generation) applications, text-embedding-3-small is sufficient and significantly cheaper than text-embedding-3-large. Use text-embedding-3-large only if you have measured that it produces meaningfully better retrieval quality for your specific use case.

Assistants API vs Direct Completions

The Assistants API manages conversation threads, file uploads, and tool calling state server-side. It is useful for applications where you want OpenAI to manage conversation history instead of maintaining it yourself.

Use direct Chat Completions when: you want full control over the conversation, you are integrating with your own storage, or you need lower latency (Assistants API has more overhead).

Use the Assistants API when: you want built-in file handling, you want built-in code interpreter, or you need thread management across sessions without building your own.

Batch API: 50% Discount

The Batch API processes requests asynchronously with a 50% discount and 24-hour turnaround. Ideal for bulk processing.

// Create JSONL file with batch requests
const requests = documents.map((doc, i) => ({
  custom_id: `item-${i}`,
  method: "POST",
  url: "/v1/chat/completions",
  body: {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: `Summarize: ${doc}` }],
    max_tokens: 500,
  },
}));

// Upload as file
const file = await client.files.create({
  file: new File(
    [requests.map((r) => JSON.stringify(r)).join("
")],
    "batch.jsonl",
    { type: "application/json" }
  ),
  purpose: "batch",
});

// Create batch
const batch = await client.batches.create({
  input_file_id: file.id,
  endpoint: "/v1/chat/completions",
  completion_window: "24h",
});

Rate Limits and Tier Progression

OpenAI uses a tier system for rate limits. New accounts start at Tier 1 with conservative limits. Spending history unlocks higher tiers with higher rate limits.

| Tier | Monthly Spend Required | GPT-4o RPM | |---|---|---| | Tier 1 | $5 spent | 500 | | Tier 2 | $50 spent | 5,000 | | Tier 3 | $100 spent | 10,000 | | Tier 4 | $250 spent | 10,000 | | Tier 5 | $1,000 spent | 30,000 |

For production applications with high request volumes, plan for tier progression. You cannot apply to increase tiers — they unlock automatically based on spend.

Prompt Caching

OpenAI supports automatic prompt caching for long, repeated content. Unlike Anthropic's manual cache_control markers, OpenAI caches the longest common prefix of your messages automatically. Cached tokens are charged at 50% of normal input price.

For prompt caching to activate, the cached prefix must be at least 1,024 tokens. Keep your system prompts stable across requests to benefit from caching.

Cost Management Strategy

Default to gpt-4o-mini for all new features; upgrade to gpt-4o only when quality is insufficient
Set max_tokens tightly — if you need 200-word responses, set max_tokens to 300, not 4096
Use the Batch API for any non-interactive bulk processing
Monitor usage per feature in your application code (log token counts with each request)
Set usage limits in the OpenAI dashboard to prevent runaway costs
Cache embeddings in your vector database rather than re-embedding the same content repeatedly

Keep Reading

Vercel AI SDK Guide — higher-level wrapper for Next.js that simplifies OpenAI integration
LLM API Pricing Comparison 2026 — comparing OpenAI pricing against Anthropic and others
o1 and o3 Reasoning Models Guide — when to use o1 vs gpt-4o

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

Building Semantic Search: Finding Results by Meaning, Not Keywords

Model Catalog (May 2026)

Authentication and Organization

Chat Completions

Streaming

Function Calling

Structured Outputs

Embeddings

Assistants API vs Direct Completions

Batch API: 50% Discount

Rate Limits and Tier Progression

Prompt Caching

Cost Management Strategy

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

OpenAI API Guide 2026: Models, Structured Outputs, Batch API, and Cost Optimization

Related Articles

Testing HTTP APIs Effectively: Beyond the Happy Path

Building Semantic Search: Finding Results by Meaning, Not Keywords

Model Catalog (May 2026)

Authentication and Organization

Chat Completions

Streaming

Function Calling

Structured Outputs

Embeddings

Assistants API vs Direct Completions

Batch API: 50% Discount

Rate Limits and Tier Progression

Prompt Caching

Cost Management Strategy

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

The workspace your team
actually needs