AI Agent Memory: How Agents Remember Across Steps

How AI agents remember — short-term context, long-term memory, and the role of vector stores. A plain-English guide to why agents forget, and the memory architectures that fix it.

In the planning chapter we glossed over something crucial: to plan across many steps, an agent has to remember what it already did. But there's a catch that trips up everyone new to agents — the model itself remembers nothing. Every memory an agent appears to have is something your code is storing and re-feeding. Understanding that one fact explains every memory architecture in this chapter.

The model is stateless

When you call an LLM, it reads the text you send, predicts a response, and forgets everything. Call it again and it has no idea the first call happened. As we noted back in the agent loop, each pass through the loop the model re-reads the whole accumulated history from scratch.

So building agent memory is really answering one question, over and over: what does the model need to see right now to make a good next decision? Two kinds of memory answer it at different timescales.

Short-term memory: the running context

Short-term memory is the history of the current task — the goal, the actions taken, the results observed — held in the context window. It's fast and exact: the model sees the literal record of what just happened.

Its limit is size. The context window is finite (and every token costs money and slows things down), so short-term memory can't grow forever. A long-running agent will eventually overflow it.

System + goal

0.8K

After 5 steps

After 20 steps

18K

Window limit

32K

A single agent run accumulates context fast

When the running history approaches the limit, the agent has to compress it — typically by summarizing older steps into a short recap and dropping the verbose detail. The gist survives; the transcript doesn't. This is "forgetting on purpose," and doing it well is an art: summarize too aggressively and you lose the fact you needed three steps later.

Long-term memory: knowledge that outlives the run

Long-term memory is everything an agent should remember beyond a single run: a user's preferences, facts it learned last week, the outcome of a past task. This can't live in the context window — it's far too big — so it's stored externally and pulled in only when relevant.

The standard tool is a vector store. Each memory is converted into an embedding (a list of numbers capturing its meaning), and at decision time the agent searches for memories whose meaning is closest to the current situation.

New fact

Embed it

Store in vector DB

Later: search by meaning

Inject top matches

Long-term memory: store as embeddings, retrieve by relevance

If embeddings and vector search are new to you, don't worry — they're the entire subject of the next guide on RAG, and the mechanism is identical. For now the idea is enough: store much, retrieve little, inject only what's relevant.

Retrieval: the bridge between the two

You can't put all of long-term memory into the prompt, so memory is always paired with retrieval. Before each important decision, the agent searches its long-term store for the handful of memories most relevant to the current goal and injects only those into short-term context.

	Short-term memory	Long-term memory
Lives in	The context window	An external store (often a vector DB)
Scope	The current run	Across runs, indefinitely
Access	Always present	Retrieved on demand
Limit	Window size	Effectively unlimited
Cost	Tokens every call	Storage + a search per use

This is the same money-saving logic from the tokens chapter: context is precious, so spend it only on what changes the next decision.

The context budget is everything

Every design choice in agent memory comes back to one constraint: the context window is a fixed budget, and everything competes for it — the system instructions, the tools, the running history, and the retrieved memories. More of one means less of another.

That's why good agents are aggressive editors of their own context. They summarize old steps, retrieve only the top few memories, and drop anything that won't influence what they do next. An agent that remembers everything isn't smart — it's expensive, slow, and easily distracted by irrelevant detail.

Recap

The model is stateless — every agent memory is context your code stores and re-feeds.
Short-term memory is the current run's history in the context window: exact, but size-limited.
Long-term memory stores knowledge beyond one run, usually as embeddings in a vector store.
Retrieval bridges them: search long-term memory and inject only the most relevant pieces.
It all comes down to the context budget — keep what helps the next decision, compress or drop the rest.

Memory lets an agent know things. But to do things, it needs to reach outside itself — to search, run code, and call APIs. That's the role of tools. Continue to AI Agent Tool Calling.