Anthropic API Guide: Claude Integration From Authentication to Prompt Caching

Complete guide to the Anthropic API - authentication, message format, streaming, tool use, prompt caching for 90% cost reduction, batch processing, and production error handling.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

10 min read

// tags

#anthropic#claude#api#prompt-caching#tool-use

FIG. ART-19

10 min read

“

Anthropic API Guide: Claude Integration From Authentication to Prompt Caching

// reading plan

sections

1,182

words

min read

// LLM & Language Models

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

OpenAI's frontier models and Codex are now available on AWS through Amazon Bedrock and SageMaker. This post covers what's included, how it works, and the practical tradeoffs for teams considering this integration.

4 min read

// LLM & Language Models

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

Basic Message Format

const response = await client.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  system: "You are a helpful assistant.",
  messages: [
    { role: "user", content: "What is the capital of France?" }
  ],
});

const text = response.content[0].type === "text"
  ? response.content[0].text
  : "";

Key parameters:

model: which Claude model to use
max_tokens: maximum output length (required, not optional)
system: system prompt (optional, appears before the conversation)
messages: array of alternating user/assistant turns

The content array in the response can contain multiple content blocks. For text-only responses there will be one block with type: "text".

Multi-Turn Conversations

const messages = [
  { role: "user", content: "What is React?" },
  { role: "assistant", content: "React is a JavaScript library..." },
  { role: "user", content: "How does it differ from Vue?" },
];

const response = await client.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages,
});

You maintain conversation history by including all previous turns in the messages array. There is no server-side session - you send the full history with each request.

Streaming Responses

const stream = client.messages.stream({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Write a haiku about TypeScript." }],
});

for await (const event of stream) {
  if (
    event.type === "content_block_delta" &&
    event.delta.type === "text_delta"
  ) {
    process.stdout.write(event.delta.text);
  }
}

const finalMessage = await stream.finalMessage();

Streaming returns events as text is generated. The content_block_delta events with text_delta deltas contain the streaming text. finalMessage() waits for completion and returns the complete response object.

Tool Use (Function Calling)

Tool use in the Anthropic API requires handling a multi-turn pattern manually. The model returns a tool_use content block, you execute the tool, then send the result back as a tool_result user message.

const tools = [
  {
    name: "get_weather",
    description: "Get the current weather for a location",
    input_schema: {
      type: "object" as const,
      properties: {
        location: { type: "string", description: "City name" },
      },
      required: ["location"],
    },
  },
];

// First request
const response = await client.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  tools,
  messages: [{ role: "user", content: "What is the weather in Tokyo?" }],
});

// Check if model wants to use a tool
if (response.stop_reason === "tool_use") {
  const toolUseBlock = response.content.find((b) => b.type === "tool_use");
  if (toolUseBlock && toolUseBlock.type === "tool_use") {
    const toolResult = await executeWeatherTool(toolUseBlock.input);

    // Second request with tool result
    const finalResponse = await client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 1024,
      tools,
      messages: [
        { role: "user", content: "What is the weather in Tokyo?" },
        { role: "assistant", content: response.content },
        {
          role: "user",
          content: [
            {
              type: "tool_result",
              tool_use_id: toolUseBlock.id,
              content: JSON.stringify(toolResult),
            },
          ],
        },
      ],
    });
  }
}

Prompt Caching: 90% Discount on Repeated Content

Prompt caching is the most impactful cost optimization available on the Anthropic API. When you mark content with cache_control: { type: "ephemeral" }, Anthropic caches that content for 5 minutes. Subsequent requests that include the same cached content are charged at 10% of the normal input token price.

This is valuable when your system prompt is large and stable - detailed instructions, tool definitions, reference documentation loaded into the system prompt.

const response = await client.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are an expert coding assistant. Here is the full API documentation:
" +
        longDocumentationString,  // potentially thousands of tokens
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userQuestion }],
});

The first request caches the system prompt (charged at full price). Subsequent requests within 5 minutes that include the same system prompt content are charged 10% for those tokens. For applications where the system prompt is large and requests are frequent, this can reduce costs by 60-80%.

Cached tokens appear in the response usage object as cache_read_input_tokens and cache_creation_input_tokens.

Token Counting

Count tokens before sending to avoid exceeding limits or to budget requests:

const { input_tokens } = await client.messages.countTokens({
  model: "claude-3-5-sonnet-20241022",
  system: "You are a helpful assistant.",
  messages: [{ role: "user", content: "Hello" }],
});

console.log(`This request will use ${input_tokens} input tokens`);

Batch Messages API: 50% Discount

The Message Batches API processes requests asynchronously with a 50% price discount and a 24-hour turnaround. Use this for offline processing tasks where immediate response is not required - document processing, bulk analysis, dataset generation.

const batch = await client.messages.batches.create({
  requests: documents.map((doc, i) => ({
    custom_id: `doc-${i}`,
    params: {
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 1024,
      messages: [{ role: "user", content: `Summarize: ${doc}` }],
    },
  })),
});

// Poll for completion
let completed = false;
while (!completed) {
  const status = await client.messages.batches.retrieve(batch.id);
  if (status.processing_status === "ended") {
    completed = true;
  }
  await new Promise((r) => setTimeout(r, 30000));  // check every 30s
}

Error Handling

import Anthropic from "@anthropic-ai/sdk";

try {
  const response = await client.messages.create({ /* ... */ });
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    switch (error.status) {
      case 400:
        // Validation error  -  malformed request, invalid parameters
        console.error("Bad request:", error.message);
        break;
      case 429:
        // Rate limit  -  back off and retry
        const retryAfter = error.headers?.["retry-after"];
        await new Promise((r) => setTimeout(r, Number(retryAfter) * 1000));
        break;
      case 529:
        // API overloaded  -  retry with exponential backoff
        await new Promise((r) => setTimeout(r, 5000));
        break;
      case 500:
        // Server error  -  retry
        break;
    }
  }
}

Cost Optimization Strategy

Use Haiku for high-volume simple tasks (classification, short Q&A, extraction)
Use Sonnet for complex tasks (coding, analysis, long-form generation)
Enable prompt caching when your system prompt exceeds 1,000 tokens and requests are frequent
Use the Batch API for non-time-sensitive bulk processing
Set max_tokens as tightly as practical to avoid paying for tokens you do not need

Keep Reading

Vercel AI SDK Guide - higher-level wrapper that simplifies Anthropic API integration
Cutting LLM API Costs - full cost optimization guide including caching and batching
Claude 3.5 Sonnet Review - when to use Claude vs other models

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Model	Identifier	Use Case
Claude 3.5 Sonnet	`claude-3-5-sonnet-20241022`	Complex tasks, coding, analysis
Claude 3 Haiku	`claude-3-haiku-20240307`	Fast, cheap, simple tasks
Claude 3 Opus	`claude-3-opus-20240229`	Maximum capability (slower, expensive)

Anthropic API Guide: Claude Integration From Authentication to Prompt Caching

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

Authentication and Setup

Model Names

Basic Message Format

Multi-Turn Conversations

Streaming Responses

Tool Use (Function Calling)

Prompt Caching: 90% Discount on Repeated Content

Token Counting

Batch Messages API: 50% Discount

Error Handling

Cost Optimization Strategy

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

Anthropic API Guide: Claude Integration From Authentication to Prompt Caching

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

Authentication and Setup

Model Names

Basic Message Format

Multi-Turn Conversations

Streaming Responses

Tool Use (Function Calling)

Prompt Caching: 90% Discount on Repeated Content

Token Counting

Batch Messages API: 50% Discount

Error Handling

Cost Optimization Strategy

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

The workspace your team
actually needs