Chapter 32·Intermediate·10 min read
AI Agent Memory: How Agents Remember Across Steps
How AI agents remember — short-term context, long-term memory, and the role of vector stores. A plain-English guide to why agents forget, and the memory architectures that fix it.
June 30, 2026
In the planning chapter we glossed over something crucial: to plan across many steps, an agent has to remember what it already did. But there's a catch that trips up everyone new to agents — the model itself remembers nothing. Every memory an agent appears to have is something your code is storing and re-feeding. Understanding that one fact explains every memory architecture in this chapter.
The model is stateless
When you call an LLM, it reads the text you send, predicts a response, and forgets everything. Call it again and it has no idea the first call happened. As we noted back in the agent loop, each pass through the loop the model re-reads the whole accumulated history from scratch.
So building agent memory is really answering one question, over and over: what does the model need to see right now to make a good next decision? Two kinds of memory answer it at different timescales.
Short-term memory: the running context
Short-term memory is the history of the current task — the goal, the actions taken, the results observed — held in the context window. It's fast and exact: the model sees the literal record of what just happened.
Its limit is size. The context window is finite (and every token costs money and slows things down), so short-term memory can't grow forever. A long-running agent will eventually overflow it.
When the running history approaches the limit, the agent has to compress it — typically by summarizing older steps into a short recap and dropping the verbose detail. The gist survives; the transcript doesn't. This is "forgetting on purpose," and doing it well is an art: summarize too aggressively and you lose the fact you needed three steps later.
Long-term memory: knowledge that outlives the run
Long-term memory is everything an agent should remember beyond a single run: a user's preferences, facts it learned last week, the outcome of a past task. This can't live in the context window — it's far too big — so it's stored externally and pulled in only when relevant.
The standard tool is a vector store. Each memory is converted into an embedding (a list of numbers capturing its meaning), and at decision time the agent searches for memories whose meaning is closest to the current situation.
If embeddings and vector search are new to you, don't worry — they're the entire subject of the next guide on RAG, and the mechanism is identical. For now the idea is enough: store much, retrieve little, inject only what's relevant.
Retrieval: the bridge between the two
You can't put all of long-term memory into the prompt, so memory is always paired with retrieval. Before each important decision, the agent searches its long-term store for the handful of memories most relevant to the current goal and injects only those into short-term context.
| Short-term memory | Long-term memory | |
|---|---|---|
| Lives in | The context window | An external store (often a vector DB) |
| Scope | The current run | Across runs, indefinitely |
| Access | Always present | Retrieved on demand |
| Limit | Window size | Effectively unlimited |
| Cost | Tokens every call | Storage + a search per use |
This is the same money-saving logic from the tokens chapter: context is precious, so spend it only on what changes the next decision.
The context budget is everything
Every design choice in agent memory comes back to one constraint: the context window is a fixed budget, and everything competes for it — the system instructions, the tools, the running history, and the retrieved memories. More of one means less of another.
That's why good agents are aggressive editors of their own context. They summarize old steps, retrieve only the top few memories, and drop anything that won't influence what they do next. An agent that remembers everything isn't smart — it's expensive, slow, and easily distracted by irrelevant detail.
Recap
- The model is stateless — every agent memory is context your code stores and re-feeds.
- Short-term memory is the current run's history in the context window: exact, but size-limited.
- Long-term memory stores knowledge beyond one run, usually as embeddings in a vector store.
- Retrieval bridges them: search long-term memory and inject only the most relevant pieces.
- It all comes down to the context budget — keep what helps the next decision, compress or drop the rest.
Memory lets an agent know things. But to do things, it needs to reach outside itself — to search, run code, and call APIs. That's the role of tools. Continue to AI Agent Tool Calling.