Memory Systems
Models forget every conversation. Memory systems let them remember.
LLMs are stateless. Each API call is independent. To build assistants that "remember" a user across sessions, you need an explicit memory layer. There are three useful kinds.
Short-term: the context window
The current conversation, replayed every turn. Costs add up — a 50-turn chat replays 49 turns per call.
Optimisations: Prompt caching (Anthropic, OpenAI, Gemini, DeepSeek) caches the prefix at 10% cost on subsequent calls. Sliding window drops oldest turns past a budget. Summarisation replaces old turns with an LLM summary when window fills.
Long-term: facts about the user
Persistent state across sessions. User prefers metric units. User's name is Rohit.
Simple implementation: after each conversation, ask the LLM to "summarise persistent facts about the user." Store keyed by user_id. On every new conversation, inject the user's accumulated facts into the system prompt. Services like Mem0, Letta, and Zep sell this as a service.
Episodic: searchable conversation history
Sometimes you want "what did I tell you about my project last Tuesday?" That's RAG over the user's own chat history — embed every conversation, search when needed. Most powerful, most expensive.
Memory failure modes
- Memory pollution — bad facts accumulate. Always include timestamps and let users edit memory.
- Memory leak — inject everything into context, drown the model. Curate aggressively.
- Memory conflict — old fact contradicts new behaviour. Establish recency rules.
When to actually use memory
Most "memory" use cases are over-engineered. Before adding a memory layer ask: Does the user need persistence across sessions, or just within? Will users actively notice the absence? Can you ask once and store in a normal DB? The bar should be high. Most production assistants do fine with a structured user profile + session context.