Skip to content
JSBlogs
Go back

Why LLMs forget everything — and what you must do about it

A customer started a conversation with the TechGadgets support assistant. “I ordered some headphones last week.” Then: “Are they covered by the warranty?” The assistant replied: “Could you clarify what you’re asking about?”

It had forgotten the headphones. It had forgotten everything said before this message.

This is not a bug in Dev’s implementation. It is the fundamental nature of LLMs.

Table of contents

Open Table of contents

LLMs are stateless by design

Every API call to an LLM is completely independent. The model processes the prompt you send, generates a response, and discards all state. The next call starts from zero.

This is intentional. Statelessness makes LLMs horizontally scalable — any request can go to any server, no session affinity required. It makes them safe to cache, retry, and rate-limit. The trade-off is that they have no built-in memory of previous interactions.

When a user says “are they covered by the warranty?”, the LLM only sees that sentence — unless you include the prior conversation in the current request.

Important: The LLM does not have a session. There is no server-side conversation state. The only "memory" an LLM has is what you put in the prompt. If you want the model to know what was said before, you must send it again — every single time.

The context window is both the solution and the constraint

The solution to statelessness is simple: include all previous messages in every request. If the conversation has 5 turns, send all 5 turns in the 6th request. The model sees the full history and can reference any of it.

This works — until the context window fills up.

The context window is the maximum amount of text (measured in tokens) that a model can process in a single call. Every model has a hard limit:

ModelContext window
GPT-4o-mini128,000 tokens (~96,000 words)
GPT-4o128,000 tokens
Claude Sonnet200,000 tokens
Llama 3.1 (8B)128,000 tokens
Gemini 1.5 Pro1,000,000 tokens

A 128K token context window sounds enormous — and for short conversations, it is. But in a support chatbot used continuously, a conversation can accumulate thousands of tokens over dozens of turns. Add retrieved RAG context (which also consumes window space), and the context window fills faster than expected.

Two problems arise when the context window fills:

The API returns an error. You tried to send more tokens than the model accepts. The request fails hard.

Costs spike silently. Most models charge per token — both input and output. A conversation with 50K tokens of history sent on every turn costs 50K tokens of input per request, even if the new message is 20 words. Costs scale with history length, not just message length.

Caution: Naively appending all messages to every request is dangerous in long-running applications. A session that runs for an hour with one message per minute can easily exceed 50,000 tokens. Monitor token usage with ChatResponse.getMetadata().getUsage() and build a management strategy before hitting production.

The three memory strategies

There is no universally correct memory strategy — each trades off between recall, cost, and complexity:

StrategyHow it worksBest forDrawback
Full historySend all previous messages every turnShort sessions, critical recallExpensive and slow for long sessions
Windowed memoryKeep only the last N messagesMost chatbotsMay lose early context
SummarisationSummarise older history, keep recent turns in fullLong sessionsExtra LLM call to summarise

Spring AI supports all three through its ChatMemory interface. The next posts show each in code.

The anatomy of a conversation in LLM API terms

When you send a conversation to an LLM, you send a list of messages. Each message has a role:

A three-turn conversation looks like this when sent to the API:

[SYSTEM]    You are a support assistant for TechGadgets...
[USER]      I ordered some headphones last week.
[ASSISTANT] I'd be happy to help with your headphone order. Do you have the order number?
[USER]      Yes, it's TG-9821.
[ASSISTANT] I found order TG-9821. The ProX Wireless Headphones are estimated to arrive Friday.
[USER]      Are they covered by the warranty?

The model reads the full list, understands “they” refers to the ProX Wireless Headphones from order TG-9821, and answers accordingly.

Your application must build this list and send it with every new user message. That is what Spring AI’s MessageChatMemoryAdvisor does.

Why memory is separate from RAG

A common question: can’t the vector store hold conversation history, and can’t RAG retrieve it?

Memory and RAG serve different purposes:

Conversation memoryRAG
ContentWhat was said in this conversationYour knowledge base documents
Access patternAll recent turns, in orderTop-K semantically similar chunks
PurposeContinuity across turnsGrounding in factual knowledge
StorageOrdered list of messagesUnordered vector embeddings

Conversation memory must preserve order — you cannot retrieve “the last 5 messages” from a vector store meaningfully, because similarity search returns by relevance, not by recency.

Use RAG to ground answers in your knowledge base. Use chat memory to maintain continuity within a conversation. In a fully-featured assistant, you use both.

Tip: When the context window is tight, prioritise recent messages over old ones. The last 2–3 turns usually contain the most relevant context for the current question. Earlier turns can be summarised or dropped entirely without losing much conversational coherence.

What Spring AI provides

Spring AI models memory through two abstractions:

ChatMemory — stores and retrieves messages. Implementations include:

MessageChatMemoryAdvisor — an advisor that injects stored conversation history into every ChatClient call, and saves the new messages (user input + model response) back to ChatMemory after each turn.

The next post wires InMemoryChatMemory into the support assistant. The post after that replaces it with JdbcChatMemory for production persistence.

Note: Memory in Spring AI is keyed by a **conversation ID**. Each user session gets a unique ID, and all messages in that session are stored and retrieved under that ID. This is how multiple concurrent users can have independent conversations through the same application instance.

References


Share this post on:

Module 05 · Memory — Conversations That Actually Make Sense · Next up

Chat memory in Spring AI — building a chatbot that remembers


Next Post
Streaming LLM responses in Spring AI for a better user experience