Level 3 · RAG & Agents
8 min

Memory Systems

Models forget every conversation. Memory systems let them remember.

LLMs are stateless. Each API call is independent. To build assistants that "remember" a user across sessions, you need an explicit memory layer. There are three useful kinds.

Short-term: the context window

The current conversation, replayed every turn. Costs add up — a 50-turn chat replays 49 turns per call.

Optimisations: Prompt caching (Anthropic, OpenAI, Gemini, DeepSeek) caches the prefix at 10% cost on subsequent calls. Sliding window drops oldest turns past a budget. Summarisation replaces old turns with an LLM summary when window fills.

Long-term: facts about the user

Persistent state across sessions. User prefers metric units. User's name is Rohit.

Simple implementation: after each conversation, ask the LLM to "summarise persistent facts about the user." Store keyed by user_id. On every new conversation, inject the user's accumulated facts into the system prompt. Services like Mem0, Letta, and Zep sell this as a service.

Episodic: searchable conversation history

Sometimes you want "what did I tell you about my project last Tuesday?" That's RAG over the user's own chat history — embed every conversation, search when needed. Most powerful, most expensive.

Memory failure modes

  1. Memory pollution — bad facts accumulate. Always include timestamps and let users edit memory.
  2. Memory leak — inject everything into context, drown the model. Curate aggressively.
  3. Memory conflict — old fact contradicts new behaviour. Establish recency rules.

When to actually use memory

Most "memory" use cases are over-engineered. Before adding a memory layer ask: Does the user need persistence across sessions, or just within? Will users actively notice the absence? Can you ask once and store in a normal DB? The bar should be high. Most production assistants do fine with a structured user profile + session context.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Main cost issue with naive multi-turn chat?

Q2.How does prompt caching reduce cost?

Q3.Most common pitfall in long-term user memory?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs