Building streaming AI chat with Server-Sent Events

Our AI chat feature lets users ask questions about their conversations and get answers with citations. The responses stream token by token with a blinking cursor, citations render as interactive badges, and users can stop generation mid-response.

This post covers the decisions we made while building it and the problems that shaped the final architecture.

Why SSE instead of WebSocket

We already use WebSocket for real-time transcription progress. The obvious choice for streaming chat would be to reuse that same connection.

We chose Server-Sent Events instead, and it turned out to be the right call for three reasons.

The data flow is unidirectional. The user sends a question (one HTTP request), and the server streams back a response (many tokens). There is no bidirectional communication during streaming. SSE is designed for exactly this pattern.

SSE works with standard HTTP infrastructure. It is a regular HTTP response with Content-Type: text/event-stream. Load balancers, proxies, and CDNs understand it natively. WebSocket requires a protocol upgrade that some intermediaries handle poorly — a lesson we learned from the polling fallback work on transcription progress.

It is simpler to implement and maintain. An SSE endpoint is a regular HTTP route that keeps the response open and writes events. No handshake, no protocol upgrade, no connection state management.

The trade-off: SSE is server-to-client only. But for chat, the client sends a question via a normal POST request and receives the response via SSE. The constraint matches the use case.

What this gave us: A streaming implementation that works reliably across network environments without the infrastructure issues we had already encountered with WebSocket. The simplicity also made features like stop generation straightforward to implement.

Designing the event protocol

We needed a protocol that could handle the full lifecycle of a chat response: session setup, pre-processing status, token streaming, citation delivery, completion, and errors.

We settled on six typed events:

session — Sent first. Contains the session ID for multi-turn context.

status — Sent before tokens begin. Shows messages like "Searching across 12 conversations..." while the system indexes and searches. Without this, users see a blank loading state for 2-5 seconds on large folders and wonder if anything is happening.

token — The streaming content. One event per token fragment. The majority of the stream.

citations — Sent after all tokens. Contains structured citation data for rendering interactive badges.

done — Signals completion. Includes processing time for analytics.

error — Sent if something goes wrong. The client displays the error and stops the loading state.

Since the browser's native EventSource API does not support POST requests or custom headers, we use the fetch API with manual stream parsing. The key subtlety: decoder.decode(value, { stream: true }) tells the decoder that more data is coming, preventing multi-byte characters that span chunk boundaries from being decoded incorrectly.

Solving the citation flicker problem

This was the hardest problem to solve well, and the one where the implementation decision had the biggest UX impact.

Citations in our system look like [12:45, Sarah Johnson]. They stream as individual tokens:

token: "["
token: "12"
token: ":45"
token: ", Sarah"
token: " Johnson"
token: "]"

If we render each token immediately, the user sees [12:45, Sarah as plain text for a fraction of a second before the closing bracket arrives and it becomes a styled citation badge. This flicker made the streaming experience feel broken.

Our solution is a buffering system:

1When a [ character arrives, start buffering instead of rendering.
2Accumulate tokens into the buffer.
3If the accumulated text matches the citation pattern, render it as a citation badge.
4If it exceeds a reasonable length without matching, flush the buffer as plain text.

The buffer timeout matters. If the LLM pauses mid-citation, we do not want to flush prematurely. We use a generous timeout, since the LLM's token generation is fast enough that a half-second pause within a citation is rare.

What this changed: The streaming experience went from visually glitchy to seamless. Users see citations appear as complete badges, never as half-formed text. The buffer adds imperceptible latency while eliminating all visual artifacts.

Moving chat history from client to server

In V1, we tracked conversation history client-side, truncating to 6 items. This was fragile. History disappeared on page refresh, could not sync across devices, and limited the system's ability to provide consistent multi-turn context.

In V2, we moved history to the database. Each chat session persists its messages server-side. When the user asks a follow-up question, the server loads the 10 most recent messages (5 question-answer pairs) and includes them in the LLM prompt. The history survives page reloads, device switches, and app restarts.

System: You are answering questions about these conversations...
[Relevant passages from vector search]

Previous messages:
User: Which prospects mentioned budget constraints?
Assistant: Based on the conversations, three prospects...

Current question:
User: Of those, which ones also expressed urgency?

What this changed: Follow-up questions became reliable. Users can ask "tell me more about the second one" and get a coherent answer, even after refreshing the page or switching devices. The server-side history also opened the door to features like session management and chat export.

Stop generation without wasted API costs

Users can stop generation mid-response. This matters for long answers: if the first paragraph answers the question, the user should not have to wait for three more paragraphs.

Implementation requires coordination between client and server. On the client, clicking stop cancels the stream reader and aborts the fetch request. On the server, the closed connection triggers an abort of the LLM call. Without the server-side abort, the LLM would continue generating tokens that nobody receives, wasting API costs.

The partially generated response is preserved. Whatever tokens arrived before the stop are saved as the assistant's message, keeping the conversation history coherent for follow-up questions.

What this changed: Users feel in control of the AI. Long-winded responses are not a patience tax. And we avoid paying for tokens that the user explicitly indicated they do not need.

What these decisions taught us

Choose the transport that matches the data flow. We already had WebSocket infrastructure, but SSE was the better fit for unidirectional streaming. Using the right tool for the pattern, rather than reusing existing infrastructure, resulted in a simpler and more reliable implementation.

Buffer rendering when the display format differs from the stream format. Citations are the clearest example, but the principle applies whenever streamed content needs to be assembled before it can be rendered correctly. The buffer adds minimal latency while eliminating visual artifacts.

Server-side state is worth the storage cost for anything users expect to persist. Client-side history felt like a reasonable shortcut until users started expecting their chat context to survive a page refresh. Moving state to the server was a small migration with outsized reliability gains.

Show what the system is doing, not just that it is working. The status message ("Searching across 47 conversations...") reduces perceived latency more than actually reducing latency. Any operation over one second needs visible, specific feedback.

Building streaming AI chat with Server-Sent Events

Why SSE instead of WebSocket

Designing the event protocol

Solving the citation flicker problem

Moving chat history from client to server

Stop generation without wasted API costs

What these decisions taught us

Lees verder

Real-time progress for long-running AI tasks

Eight security vulnerabilities we found in our own code

Prompt engineering for multilingual structured output