Running DeepSeek R1 via Together AI: Fastest Hosted Reasoning Model API

Together AI hosts DeepSeek R1 671B on serverless infrastructure with streaming thinking tokens, OpenAI-compatible SDK, and sub-5-second TTFT - at $3/1M input tokens with no cold starts.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 9, 2026

8 min read

// tags

#deepseek-r1#together-ai#reasoning#api#cost

FIG. ART-27

8 min read

“

Running DeepSeek R1 via Together AI: Fastest Hosted Reasoning Model API

// reading plan

sections

501

words

min read

// Developer Tools

What is SpaceX Is Buying Cursor? A Practical Overview

SpaceX is buying Cursor, the AI-powered code editor. The deal signals a shift in how AI coding tools are valued and deployed. Here's a practical breakdown of what's happening and what it means for developers.

4 min read

// Developer Tools

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

Streaming Thinking Tokens

DeepSeek R1's "thinking" process - where it reasons through a problem before giving a final answer - can be streamed in real time. This allows you to show users a "thinking..." indicator while the model works, then display the final answer:

thinking_buffer = []
answer_buffer = []
in_thinking = False

for chunk in client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1",
    messages=[{"role": "user", "content": "Prove that the square root of 2 is irrational."}],
    stream=True,
):
    content = chunk.choices[0].delta.content or ""

    if "<think>" in content:
        in_thinking = True
    elif "</think>" in content:
        in_thinking = False
    elif in_thinking:
        thinking_buffer.append(content)
        print(f"[thinking] {content}", end="", flush=True)
    else:
        answer_buffer.append(content)
        print(content, end="", flush=True)

print("\n\nFinal answer:", "".join(answer_buffer))

Parallel Requests for Batch Processing

Together AI's serverless infrastructure handles parallel requests without rate limiting degradation. For batch reasoning tasks:

import asyncio
from together import AsyncTogether

async def reason(client: AsyncTogether, problem: str) -> str:
    response = await client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1",
        messages=[{"role": "user", "content": problem}],
    )
    return response.choices[0].message.content

async def batch_reason(problems: list[str]) -> list[str]:
    client = AsyncTogether(api_key=os.environ["TOGETHER_API_KEY"])
    tasks = [reason(client, p) for p in problems]
    return await asyncio.gather(*tasks)

results = asyncio.run(batch_reason(["Problem 1...", "Problem 2...", "Problem 3..."]))

Distilled R1 Variants for Lower Cost

Together AI also hosts the distilled R1 variants, which are much smaller but retain much of the reasoning quality:

Model	Size	Cost (input)	vs R1 671B
DeepSeek-R1	671B	$3.00/1M	Baseline
DeepSeek-R1-Distill-Llama-70B	70B	$0.88/1M	~90% quality
DeepSeek-R1-Distill-Qwen-32B	32B	$0.27/1M	~85% quality
DeepSeek-R1-Distill-Qwen-7B	7B	$0.20/1M	~75% quality

For most math and coding reasoning tasks, the 70B distill performs within 10% of the full 671B model at less than one-third the price. Start with the 70B distill and only escalate to the full model for problems that require it.

Running DeepSeek R1 via Together AI: Fastest Hosted Reasoning Model API

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

Why Together AI for DeepSeek R1?

Setting Up Together AI

Streaming Thinking Tokens

Parallel Requests for Batch Processing

Distilled R1 Variants for Lower Cost

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

Running DeepSeek R1 via Together AI: Fastest Hosted Reasoning Model API

Related Articles

What is SpaceX Is Buying Cursor? A Practical Overview

Why Together AI for DeepSeek R1?

Setting Up Together AI

Streaming Thinking Tokens

Parallel Requests for Batch Processing

Distilled R1 Variants for Lower Cost

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Open Code Review – An AI-powered code review CLI tool: A Practical Overview

What Is the Text in Claude Code's Extended Thinking Output? A Practical Overview

The workspace your team
actually needs