The Open-Model Inference Cloud
Together AI aggregates 200+ open-source models — Llama, Mistral, Qwen, Gemma, DeepSeek, and more — behind a single OpenAI-compatible API. Rather than managing GPU infrastructure or maintaining separate integrations per model, you switch models with a one-line change.
This is particularly useful for teams that want to:
- Compare multiple models on the same task without re-engineering
- Access large models (405B+) that require multi-node GPU clusters
- Fine-tune open models without managing training infrastructure
- Run batch inference jobs at lower cost than real-time endpoints
Serverless vs Dedicated Endpoints
Serverless: Pay per token, no provisioning. Cold starts possible on less popular models. Best for development and variable workloads.
Dedicated: Reserve GPU capacity for consistent latency. Required for SLA-sensitive production. Priced per GPU-hour.
Getting Started
pip install together
from together import Together
client = Together(api_key="your-together-api-key")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in simple terms."}
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
OpenAI SDK Drop-In Replacement
Together AI is compatible with the OpenAI Python SDK — just change base_url and api_key:
from openai import OpenAI
client = OpenAI(
api_key="your-together-api-key",
base_url="https://api.together.xyz/v1"
)
# Now use any Together model with the familiar OpenAI interface
response = client.chat.completions.create(
model="Qwen/Qwen2.5-72B-Instruct-Turbo",
messages=[{"role": "user", "content": "Translate to French: Hello, how are you?"}]
)
This means zero code changes for teams migrating from OpenAI — just swap credentials and model names.
Fine-Tuning API
# Upload training data
file_response = client.files.upload(
file=open("training_data.jsonl", "rb"),
)
# Start fine-tuning job
ft_job = client.fine_tuning.create(
training_file=file_response.id,
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
n_epochs=3,
learning_rate=1e-5,
)
print(f"Fine-tuning job: {ft_job.id}")
Training data must be in OpenAI's JSONL format (messages array per line). Fine-tuned models are private and available immediately after training completes.
Batch Inference
For offline workloads (nightly processing, dataset annotation, bulk translation), batch jobs are cheaper than real-time and don't count against rate limits:
batch = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(f"Batch {batch.id} queued")
Pricing Comparison
| Model | Together AI | Groq | Fireworks | |-------|-------------|------|-----------| | Llama 3.1 405B | $3.50/1M | N/A | $3.00/1M | | Llama 3.1 70B | $0.88/1M | $0.59/1M | $0.90/1M | | Llama 3.1 8B | $0.18/1M | $0.05/1M | $0.20/1M | | Qwen 2.5 72B | $1.20/1M | N/A | $0.90/1M |
For Llama 3.1 8B at high volume, Groq wins on price. For 405B or Qwen 2.5, Together AI is often the only option with good availability.
FlashAttention and Performance
Together AI's infrastructure uses FlashAttention and continuous batching by default — you get optimized throughput without configuration. The full model list shows available models, context sizes, and pricing.
Summary
Together AI is the most complete open-model inference platform: 200+ models, OpenAI compatibility, fine-tuning, batch processing, and dedicated endpoints. For teams building on open-source models, it eliminates the need to manage GPU infrastructure. Start at together.ai and explore the full API at docs.together.ai.