Why Together AI for DeepSeek R1?
DeepSeek offers its own API for R1 at $0.55/1M input tokens — cheaper than Together AI's $3/1M. But there are practical reasons to prefer Together:
- Reliability — DeepSeek's API has experienced outages and rate limiting during peak demand. Together AI runs on dedicated US-based infrastructure with SLA guarantees.
- Latency — Together AI has lower TTFT for US-based users (no cross-Pacific routing)
- Compliance — data processed via Together AI stays in the US, relevant for regulated industries
- No geoblocking — some enterprise networks block Chinese IP ranges
For teams where $3 vs $0.55 per million tokens is a meaningful budget concern, running R1 through DeepSeek direct or self-hosting makes sense. For most product teams, the reliability premium is worth it.
Setting Up Together AI
from together import Together
client = Together(api_key=os.environ["TOGETHER_API_KEY"])
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[
{
"role": "user",
"content": "Solve this step by step: If 5 workers can build a wall in 8 days, how many days would 10 workers take to build 3 such walls?",
}
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Streaming Thinking Tokens
DeepSeek R1's "thinking" process — where it reasons through a problem before giving a final answer — can be streamed in real time. This allows you to show users a "thinking..." indicator while the model works, then display the final answer:
thinking_buffer = []
answer_buffer = []
in_thinking = False
for chunk in client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{"role": "user", "content": "Prove that the square root of 2 is irrational."}],
stream=True,
):
content = chunk.choices[0].delta.content or ""
if "<think>" in content:
in_thinking = True
elif "</think>" in content:
in_thinking = False
elif in_thinking:
thinking_buffer.append(content)
print(f"[thinking] {content}", end="", flush=True)
else:
answer_buffer.append(content)
print(content, end="", flush=True)
print("\n\nFinal answer:", "".join(answer_buffer))
Parallel Requests for Batch Processing
Together AI's serverless infrastructure handles parallel requests without rate limiting degradation. For batch reasoning tasks:
import asyncio
from together import AsyncTogether
async def reason(client: AsyncTogether, problem: str) -> str:
response = await client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[{"role": "user", "content": problem}],
)
return response.choices[0].message.content
async def batch_reason(problems: list[str]) -> list[str]:
client = AsyncTogether(api_key=os.environ["TOGETHER_API_KEY"])
tasks = [reason(client, p) for p in problems]
return await asyncio.gather(*tasks)
results = asyncio.run(batch_reason(["Problem 1...", "Problem 2...", "Problem 3..."]))
Distilled R1 Variants for Lower Cost
Together AI also hosts the distilled R1 variants, which are much smaller but retain much of the reasoning quality:
| Model | Size | Cost (input) | vs R1 671B | |---|---|---|---| | DeepSeek-R1 | 671B | $3.00/1M | Baseline | | DeepSeek-R1-Distill-Llama-70B | 70B | $0.88/1M | ~90% quality | | DeepSeek-R1-Distill-Qwen-32B | 32B | $0.27/1M | ~85% quality | | DeepSeek-R1-Distill-Qwen-7B | 7B | $0.20/1M | ~75% quality |
For most math and coding reasoning tasks, the 70B distill performs within 10% of the full 671B model at less than one-third the price. Start with the 70B distill and only escalate to the full model for problems that require it.