Stream AI Responses in Production

AI responses are slow. Not unusable slow — but slow enough that a loading spinner feels like a lie. The model is thinking, tokens are generating, and your user is staring at nothing.

Streaming fixes the perception problem. Instead of waiting for the full response, tokens arrive as they’re generated. The interface feels alive. Users read while the model writes.

Here’s how I wire it up with Hono and the Vercel AI SDK.

The setup

bun add ai @ai-sdk/anthropic hono

The route

import { Hono } from "hono";
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { stream } from "hono/streaming";

const app = new Hono();

app.post("/api/chat", async (c) => {
  const { messages } = await c.req.json();

  const result = streamText({
    model: anthropic("claude-3-5-sonnet-20241022"),
    messages,
    maxTokens: 1024,
  });

  return result.toDataStreamResponse();
});

export default app;

That’s the server. toDataStreamResponse() handles the SSE headers, chunking, and error framing — you don’t write any of that yourself.

The client

import { useChat } from "@ai-sdk/react";

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat({
    api: "/api/chat",
  });

  return (
    <div>
      {messages.map((m) => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
        <button type="submit">Send</button>
      </form>
    </div>
  );
}

useChat manages the stream lifecycle — connecting, buffering chunks, updating state. The component just renders what’s there.

What actually matters in production

Timeouts. Streaming connections stay open longer than typical API calls. Set maxDuration on your serverless function — Vercel defaults to 10 seconds, which isn’t enough for longer generations. I use 60.

Error handling. The stream can fail mid-response. streamText surfaces errors through the stream itself, so toDataStreamResponse() forwards them correctly. On the client, useChat exposes an error state — handle it.

Backpressure. If the client disconnects, the model keeps generating. Use the abortSignal from the request to cancel:

app.post("/api/chat", async (c) => {
  const { messages } = await c.req.json();

  const result = streamText({
    model: anthropic("claude-3-5-sonnet-20241022"),
    messages,
    abortSignal: c.req.raw.signal,
  });

  return result.toDataStreamResponse();
});

Now if the user closes the tab, the generation stops. Tokens cost money — don’t waste them.

Why this stack

The AI SDK’s streamText is provider-agnostic. Swap anthropic(...) for openai(...) or google(...) and nothing else changes. That’s the right abstraction. Hono gives you the lightweight runtime that works on Cloudflare Workers, Vercel Edge, Bun — wherever you deploy.

Streaming isn’t an optimization. It’s a fundamental part of what makes AI interfaces feel good. Build it in from the start.

Notes