Streaming LLM responses fundamentally transforms the user experience of AI interfaces. Instead of staring at a loading spinner for 10-30 seconds, users see text appearing token by token within milliseconds of the model starting to generate. This is not just a cosmetic improvement. Perceived performance improves dramatically, and users are far more likely to engage with streaming interfaces versus polling for a complete response.
Why Streaming Matters
A GPT-4o response to a moderately complex prompt might take 15-25 seconds to fully generate. If you wait for the complete response before displaying anything, users experience a long, opaque wait with no feedback. With streaming, users see the first token within 200-500ms of sending their message. Even if the full response takes 20 seconds, the interface feels fast because feedback is immediate.
This is the same principle that makes progressive image loading (blur-up) feel faster than waiting for a full image load. The actual time is the same, but the perceived experience is fundamentally different.
Beyond perception, streaming enables more natural conversational patterns. Users can start reading the beginning of a response while the end is still generating, and they can interrupt a response that is clearly going in the wrong direction.
How It Works Technically
LLM APIs that support streaming use Server-Sent Events (SSE), a lightweight HTTP protocol where the server sends a series of text events over a persistent connection.
When you send a request with stream: true, the connection stays open and the server sends chunks like:
data: {"choices":[{"delta":{"content":"The"},"index":0}]}
data: {"choices":[{"delta":{"content":" weather"},"index":0}]}
data: {"choices":[{"delta":{"content":" in"},"index":0}]}
data: [DONE]
Each chunk contains a small delta (usually 1-5 tokens). Your client accumulates these deltas and updates the UI as they arrive.
OpenAI Streaming
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain streaming in one paragraph" }],
stream: true,
});
let fullText = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
fullText += delta;
process.stdout.write(delta); // or update your UI state
}
Anthropic Streaming
const stream = anthropic.messages.stream({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain streaming in one paragraph" }],
});
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
process.stdout.write(event.delta.text);
}
}
const finalMessage = await stream.finalMessage();
Implementation With Vercel AI SDK
For Next.js applications, the Vercel AI SDK is the most practical approach. It handles the SSE complexity, provides React hooks, and works with multiple providers through a unified interface.
// app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-4o"),
messages,
});
return result.toDataStreamResponse();
}
// components/Chat.tsx
import { useChat } from "ai/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>
{m.role}: {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
<button type="submit" disabled={isLoading}>
Send
</button>
</form>
</div>
);
}
The useChat hook handles streaming automatically. The messages array updates in real time as tokens arrive.
Handling Streaming Errors Gracefully
Streaming errors are different from standard API errors because they can occur mid-stream (after you have already started displaying content to the user).
Best practices:
- Display what you have received so far when an error occurs
- Show a clear error indicator without wiping the partial response
- Offer a retry option without losing conversation context
- Implement exponential backoff on reconnection
try {
for await (const chunk of stream) {
// accumulate and display
}
} catch (error) {
// The stream was interrupted. Show partial content and error UI.
setError("Response interrupted. The partial answer is shown above.");
}
Cancellation
Users should be able to stop a generation mid-stream. This requires an AbortController:
const controller = new AbortController();
const stream = await openai.chat.completions.create(
{
model: "gpt-4o",
messages,
stream: true,
},
{ signal: controller.signal }
);
// When user clicks "Stop":
controller.abort();
The Vercel AI SDK's useChat hook exposes a stop() function that handles this automatically.
When NOT to Stream
Streaming is not always the right choice:
Batch processing: if you are processing thousands of documents server-side and storing results, streaming provides no value. Collect the full response and store it.
API-to-API calls: when one backend service calls another and needs the complete result before proceeding, streaming adds complexity without benefit.
Short responses: for responses under 100 tokens (yes/no questions, classification outputs, short code snippets), the time-to-first-token is fast enough that streaming adds no perceptible benefit and adds implementation complexity.
When you need the final token count for billing/logging: with streaming, you need to count tokens yourself or wait for the final chunk that includes usage statistics.
Keep Reading
- Function Calling in LLMs — Streaming behaves differently with tool use
- LLM API Rate Limits: What They Are and How to Handle Them — Rate limits interact with streaming
- We Replaced 6 SaaS Tools With One: What Happened — How we use streaming AI in production at Zlyqor
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.