Together AI: Run 200+ Open Models via OpenAI-Compatible API

Together AI provides serverless inference for 200+ open-source models including Llama 3.1 405B at $3.50/1M tokens, with fine-tuning, batch jobs, and an OpenAI-compatible SDK.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 15, 2026

7 min read

// tags

#together-ai#serverless#open-models#api#inference

FIG. ART-32

7 min read

“

Together AI: Run 200+ Open Models via OpenAI-Compatible API

// reading plan

sections

475

words

min read

// Developer Tools

What is SpaceX Is Buying Cursor? A Practical Overview

SpaceX is buying Cursor, the AI-powered code editor. The deal signals a shift in how AI coding tools are valued and deployed. Here's a practical breakdown of what's happening and what it means for developers.

4 min read

// Developer Tools

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Getting Started

pip install together

from together import Together

client = Together(api_key="your-together-api-key")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in simple terms."}
    ],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

OpenAI SDK Drop-In Replacement

Together AI is compatible with the OpenAI Python SDK - just change base_url and api_key:

from openai import OpenAI

client = OpenAI(
    api_key="your-together-api-key",
    base_url="https://api.together.xyz/v1"
)

# Now use any Together model with the familiar OpenAI interface
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Translate to French: Hello, how are you?"}]
)

This means zero code changes for teams migrating from OpenAI - just swap credentials and model names.

Fine-Tuning API

# Upload training data
file_response = client.files.upload(
    file=open("training_data.jsonl", "rb"),
)

# Start fine-tuning job
ft_job = client.fine_tuning.create(
    training_file=file_response.id,
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
    n_epochs=3,
    learning_rate=1e-5,
)
print(f"Fine-tuning job: {ft_job.id}")

Training data must be in OpenAI's JSONL format (messages array per line). Fine-tuned models are private and available immediately after training completes.

Batch Inference

For offline workloads (nightly processing, dataset annotation, bulk translation), batch jobs are cheaper than real-time and don't count against rate limits:

batch = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch {batch.id} queued")

Pricing Comparison

Model	Together AI	Groq	Fireworks
Llama 3.1 405B	$3.50/1M	N/A	$3.00/1M
Llama 3.1 70B	$0.88/1M	$0.59/1M	$0.90/1M
Llama 3.1 8B	$0.18/1M	$0.05/1M	$0.20/1M
Qwen 2.5 72B	$1.20/1M	N/A	$0.90/1M

For Llama 3.1 8B at high volume, Groq wins on price. For 405B or Qwen 2.5, Together AI is often the only option with good availability.

FlashAttention and Performance

Together AI's infrastructure uses FlashAttention and continuous batching by default - you get optimized throughput without configuration. The full model list shows available models, context sizes, and pricing.

Summary

Together AI is the most complete open-model inference platform: 200+ models, OpenAI compatibility, fine-tuning, batch processing, and dedicated endpoints. For teams building on open-source models, it eliminates the need to manage GPU infrastructure. Start at together.ai and explore the full API at docs.together.ai.

Together AI: Run 200+ Open Models via OpenAI-Compatible API

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

The Open-Model Inference Cloud

Serverless vs Dedicated Endpoints

Getting Started

OpenAI SDK Drop-In Replacement

Fine-Tuning API

Batch Inference

Pricing Comparison

FlashAttention and Performance

Summary

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

Together AI: Run 200+ Open Models via OpenAI-Compatible API

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

The Open-Model Inference Cloud

Serverless vs Dedicated Endpoints

Getting Started

OpenAI SDK Drop-In Replacement

Fine-Tuning API

Batch Inference

Pricing Comparison

FlashAttention and Performance

Summary

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

The workspace your team
actually needs