Skip to content
JSBlogs
Go back

Streaming LLM responses in Spring AI for a better user experience

Dev showed the support assistant to the product manager. It worked well, but there was a pause — sometimes two to three seconds — before the answer appeared. “It feels slow,” said the PM. “Can you make it show the response as it types, like ChatGPT?”

The latency was not slow. The model was generating the full response at normal speed. The problem was that Dev’s endpoint waited for the complete response before sending anything. Streaming fixes that.

Table of contents

Open Table of contents

Why streaming matters

An LLM generates text one token at a time. With a non-streaming call, your server waits until the last token is generated, assembles the full text, and sends it in one response. The user waits with nothing to look at.

With streaming, each token (or small batch of tokens) is sent to the client as soon as it is generated. The response appears to “type itself” in real time. For a 300-word response that takes 3 seconds to generate, streaming makes the experience feel instant because the first word appears within milliseconds.

The total generation time is identical. The perceived latency is dramatically lower.

Important: Streaming is a UX concern, not a performance optimization. The model generates at the same speed either way. The difference is when the client first sees output — immediately with streaming vs after the full response without it.

.stream() in Spring AI

ChatClient has a .stream() alternative to .call(). It returns a StreamResponseSpec rather than a CallResponseSpec.

// Non-streaming — blocks until complete
String complete = chatClient.prompt()
        .user("Explain dependency injection in Java.")
        .call()
        .content();

// Streaming — returns a Flux that emits tokens as they arrive
Flux<String> tokens = chatClient.prompt()
        .user("Explain dependency injection in Java.")
        .stream()
        .content();

stream().content() returns a Flux<String> where each element is a partial token or small chunk of text. You subscribe to it to consume the tokens.

Wiring streaming to a Server-Sent Events endpoint

Server-Sent Events (SSE) is the standard browser protocol for server-to-client streaming over HTTP. Spring WebFlux and Spring MVC both support SSE endpoints. Spring AI’s Flux<String> maps directly to SSE.

@RestController
@RequestMapping("/api/support")
class SupportStreamController {

    private final ChatClient chatClient;

    SupportStreamController(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    @GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    Flux<String> stream(@RequestParam String question) {
        return chatClient.prompt()
                .user(question)
                .stream()
                .content();
    }
}

Three things to notice:

  1. The return type is Flux<String> — Spring automatically serializes each emitted string as an SSE event.
  2. produces = MediaType.TEXT_EVENT_STREAM_VALUE tells the client to expect SSE.
  3. The endpoint uses @GetMapping because browsers open SSE connections via GET.

Test it from the terminal:

curl -N "http://localhost:8080/api/support/stream?question=What+is+your+return+policy"

You will see tokens arrive one by one in the terminal output.

Tip: Use @GetMapping with a @RequestParam for simple streaming endpoints. If you need to send a complex request body (conversation history, session ID, etc.), use @PostMapping and add produces = TEXT_EVENT_STREAM_VALUE — the SSE protocol works over POST too, though some clients do not support it natively.

Consuming the stream in JavaScript

On the client side, the browser’s built-in EventSource API connects to SSE endpoints:

const source = new EventSource(
  `/api/support/stream?question=${encodeURIComponent(userQuestion)}`
);

let responseText = '';

source.onmessage = (event) => {
  responseText += event.data;
  displayElement.textContent = responseText;
};

source.onerror = () => {
  source.close();
};

Each event.data value is one chunk from the Flux<String>. Appending chunks as they arrive produces the “typing” effect.

For POST-based streaming (when you need to send a request body), use the fetch API with streaming body reads instead:

const response = await fetch('/api/support/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ question: userQuestion, sessionId: sessionId })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  displayElement.textContent += decoder.decode(value);
}

Streaming with the full ChatResponse

If you need metadata from a streaming response — token counts, finish reason — use stream().chatResponse() instead of stream().content():

Flux<ChatResponse> responses = chatClient.prompt()
        .user("Summarize this product review: " + reviewText)
        .stream()
        .chatResponse();

// Process each chunk, log the last one for token counts
responses.subscribe(chunk -> {
    String text = chunk.getResult().getOutput().getText();
    if (text != null) {
        process(text);
    }
    // The last chunk contains usage metadata
    if (chunk.getMetadata().getUsage() != null) {
        log.info("Total tokens: {}", chunk.getMetadata().getUsage().getTotalTokens());
    }
});

In practice, token counts are only available on the final chunk. If you only need the text, stick with stream().content().

Caution: Flux is cold — nothing happens until someone subscribes. Returning a Flux from a Spring WebFlux controller automatically subscribes when the HTTP response is opened. But if you call .stream() inside a non-reactive method and do not subscribe, the LLM call never fires. Always verify your stream is actually consumed.

When not to stream

Streaming improves perceived latency in interactive interfaces. It adds complexity without benefit in other contexts:

ContextStream?Reason
Chat UI / assistant interfaceYesUsers see immediate response
Document generation shown to userYesLong responses benefit most
Background processing jobsNoNo user waiting; completion matters, not first-token speed
API-to-API callsNoCaller wants the complete response
Structured output (.entity())NoParsing requires the complete JSON
Classification tasksNoShort responses; streaming adds no perceived benefit
Unit testsNoComplicates assertions; use .call() in tests

The complete streaming support endpoint

Combining the setup from the previous posts with streaming:

@RestController
@RequestMapping("/api/support")
class SupportController {

    private final ChatClient chatClient;

    SupportController(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    // Non-streaming for API consumers
    @PostMapping("/chat")
    SupportResponse chat(@RequestBody SupportRequest request) {
        String answer = chatClient.prompt()
                .user(request.question())
                .call()
                .content();
        return new SupportResponse(answer);
    }

    // Streaming for browser chat interfaces
    @GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    Flux<String> chatStream(@RequestParam String question) {
        return chatClient.prompt()
                .user(question)
                .stream()
                .content();
    }
}

record SupportRequest(String question) {}
record SupportResponse(String answer) {}

Expose both. API consumers use the POST endpoint for a clean JSON response. Browser clients use the GET stream endpoint for the “typing” experience.

Note: With Module 2 complete, you have a working AI-powered chat endpoint with externalized prompts, structured output capability, and streaming support. The assistant answers questions, but only from its training data — it knows nothing about your actual products. That changes in Module 3 with embeddings and semantic search.

References


Share this post on:

Module 03 · Data and Embeddings — Teaching the AI to Understand Your Content · Next up

What are embeddings? A practical explanation for Java developers


Next Post
Getting structured JSON responses from LLMs in Spring AI