Streaming LLM Responses: How to Build Real-Time AI Interfaces
Streaming makes AI interfaces feel dramatically more responsive by showing users tokens as they generate rather than making them wait for a complete response.
Streaming LLM responses fundamentally transforms the user experience of AI interfaces. Instead of staring at a loading spinner for 10-30 seconds, users see text appearing token by token within milliseconds of the model starting to generate. This is not just a cosmetic improvement. Perceived performance improves dramatically, and users are far more likely to engage with streaming interfaces versus polling for a complete response.
Why Streaming Matters
A GPT-4o response to a moderately complex prompt might take 15-25 seconds to fully generate. If you wait for the complete response before displaying anything, users experience a long, opaque wait with no feedback. With streaming, users see the first token within 200-500ms of sending their message. Even if the full response takes 20 seconds, the interface feels fast because feedback is immediate.
This is the same principle that makes progressive image loading (blur-up) feel faster than waiting for a full image load. The actual time is the same, but the perceived experience is fundamentally different.
Beyond perception, streaming enables more natural conversational patterns. Users can start reading the beginning of a response while the end is still generating, and they can interrupt a response that is clearly going in the wrong direction.
How It Works Technically
LLM APIs that support streaming use Server-Sent Events (SSE), a lightweight HTTP protocol where the server sends a series of text events over a persistent connection.
When you send a request with stream: true, the connection stays open and the server sends chunks like:
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain streaming in one paragraph" }],
stream: true,
});
let fullText = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
fullText += delta;
process.stdout.write(delta); // or update your UI state
}
Anthropic Streaming
const stream = anthropic.messages.stream({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain streaming in one paragraph" }],
});
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
process.stdout.write(event.delta.text);
}
}
const finalMessage = await stream.finalMessage();
Implementation With Vercel AI SDK
For Next.js applications, the Vercel AI SDK is the most practical approach. It handles the SSE complexity, provides React hooks, and works with multiple providers through a unified interface.
// app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-4o"),
messages,
});
return result.toDataStreamResponse();
}
The useChat hook handles streaming automatically. The messages array updates in real time as tokens arrive.
Handling Streaming Errors Gracefully
Streaming errors are different from standard API errors because they can occur mid-stream (after you have already started displaying content to the user).
Best practices:
Display what you have received so far when an error occurs
Show a clear error indicator without wiping the partial response
Offer a retry option without losing conversation context
Implement exponential backoff on reconnection
try {
for await (const chunk of stream) {
// accumulate and display
}
} catch (error) {
// The stream was interrupted. Show partial content and error UI.
setError("Response interrupted. The partial answer is shown above.");
}
Cancellation
Users should be able to stop a generation mid-stream. This requires an AbortController:
const controller = new AbortController();
const stream = await openai.chat.completions.create(
{
model: "gpt-4o",
messages,
stream: true,
},
{ signal: controller.signal }
);
// When user clicks "Stop":
controller.abort();
The Vercel AI SDK's useChat hook exposes a stop() function that handles this automatically.
When NOT to Stream
Streaming is not always the right choice:
Batch processing: if you are processing thousands of documents server-side and storing results, streaming provides no value. Collect the full response and store it.
API-to-API calls: when one backend service calls another and needs the complete result before proceeding, streaming adds complexity without benefit.
Short responses: for responses under 100 tokens (yes/no questions, classification outputs, short code snippets), the time-to-first-token is fast enough that streaming adds no perceptible benefit and adds implementation complexity.
When you need the final token count for billing/logging: with streaming, you need to count tokens yourself or wait for the final chunk that includes usage statistics.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)
An honest, benchmark-driven comparison of Claude 3.5 Sonnet vs GPT-4o covering coding, document analysis, multimodal tasks, pricing, and real-world verdict.