Expedition 05·Beginner·12 min read
Tokens, Context Windows, and Why AI Sometimes Forgets
Why does AI lose track of long conversations, miscount letters, or 'forget' what you said earlier? It comes down to tokens and the context window. A clear, no-math explanation of how AI reads — and why it has a memory limit.
June 14, 2026
You're deep in a long chat with an AI. It was tracking everything beautifully — then suddenly it forgets a detail you gave it ten messages ago, or contradicts itself. Or you ask it to count the letters in "strawberry" and it gets it wrong. What's going on?
These aren't random failures. They come from two of the most important concepts in how AI works: tokens and the context window. Get these and a whole category of "why did it do that?" moments becomes obvious.
AI doesn't read letters — it reads tokens
We touched on tokens in chapter one. Now let's take them seriously, because they explain a surprising amount.
Before a model sees your text, the text is broken into tokens — chunks that are often a whole word, but sometimes a fragment. This step is called tokenization. Common words are usually a single token; rarer or longer words get split into pieces.
The model only ever works with these tokens. It does not see the individual letters inside them. That single fact explains several of AI's most famous oddities:
- Counting letters. "How many R's in strawberry?" is hard because the model sees the word as a couple of tokens, not a string of letters. It's being asked about something it can't directly perceive.
- Spelling backwards or by letter. Same reason — letter-level operations are awkward for a token-level reader.
- Different languages cost differently. Languages that tokenize into more pieces use up more of the model's capacity for the same meaning.
Tokens are also the unit of everything
Tokens aren't just how the model reads — they're the unit the whole system is measured and billed in:
| Thing | Measured in tokens |
|---|---|
| How much you can send | Input tokens |
| How much it can reply | Output tokens |
| What you pay | Price per 1,000 (or 1M) tokens |
| How much it can "remember" | Context window size |
A rough rule of thumb in English: 1 token ≈ ¾ of a word, or about 4 characters. So 1,000 tokens is roughly 750 words. Useful for estimating cost and whether your text will fit.
The context window: the model's working memory
Here's the crucial limit. A model can only "see" a fixed number of tokens at once. This is its context window — everything it can take into account for a single response: your prompt, the conversation so far, any documents you pasted, and the answer it's generating.
Think of it as the model's working memory, or the size of its desk. Everything it needs to reason about has to fit on the desk at the same time. Anything that doesn't fit, it cannot use.
Context windows have grown enormously, which is why modern models can read whole documents:
| Window size | Roughly equals | Era / use |
|---|---|---|
| ~4K tokens | A few pages | Early ChatGPT |
| ~32K tokens | A short report | Mid-generation models |
| ~128K tokens | A long book chapter / small codebase | Common today |
| ~1M tokens | A whole book or large codebase | Frontier long-context models |
Why AI "forgets" in long conversations
Now the forgetting makes sense. In a long chat, the conversation keeps growing, but the context window is fixed. When the conversation gets longer than the window, something has to give — and the usual rule is oldest text falls out first to make room for new text.
The dimmed messages here have scrolled out of the window. The model isn't ignoring them or being lazy — it genuinely cannot see them anymore. From its perspective, that part of the conversation never happened. That's why it can suddenly "forget" your name, contradict an earlier decision, or re-ask something you already answered.
A bigger surprise: models have no memory at all
Here's something many people get wrong. By default, a model has no memory between requests whatsoever. Each time it generates a response, it sees only the tokens in front of it — and then it's done. It doesn't store the conversation anywhere.
So how does a chat feel continuous? The app does the work. Every time you send a new message, it quietly re-sends the entire conversation so far back to the model as part of the prompt.
The "memory" you experience is really the app refilling the desk each turn. This is also why:
- Conversations get slower and pricier as they grow — there's more to re-send and re-read every time.
- Once the history exceeds the window, the oldest parts get trimmed from the re-sent prompt, and the forgetting begins.
- "Memory" features in AI products are a separate system that saves notes about you and re-injects them — not the model truly remembering.
Bigger windows aren't a free win
If forgetting comes from a small window, why not just make windows huge? Some are. But there are real trade-offs:
- Cost and speed. Processing more tokens takes more computation. A million-token prompt is slow and expensive every single turn.
- "Lost in the middle." Even within a large window, models often attend best to the start and end of the input and can overlook details buried in the middle of very long text.
- More isn't more relevant. Stuffing the window with marginally-related text can dilute the model's focus rather than help it.
| Small window | Large window | |
|---|---|---|
| Reads long documents | No | Yes |
| Cost per turn | Lower | Higher |
| Speed | Faster | Slower |
| Risk of missing buried details | Lower | Higher |
Working with the window
Once you see the context window as a finite resource, good habits follow naturally:
- Front-load what matters. Put key instructions and facts early; don't bury them.
- Trim the irrelevant. More text isn't more help — it's more noise and more cost.
- Restate key facts in long sessions. If something from early on is critical, repeat it so it stays in the window.
- Start fresh when the topic changes. A clean conversation beats one dragging a huge, stale history.
- Mind the total, not just your latest message — the whole conversation shares the window.
Recap
- AI reads in tokens (word-ish chunks), never individual letters — which is why it miscounts letters and struggles with spelling tricks.
- The context window is the model's fixed working memory; text outside it is invisible.
- AI "forgets" because old messages fall out of the window as the conversation grows.
- Models have no built-in memory — apps fake it by re-sending the conversation each turn.
- Bigger windows help but cost more, run slower, and can still miss buried details.
- Manage the window: be concise, relevant, and explicit.
We've now covered how AI works, where it came from, what's inside it, and two of its core limitations. For the finale, let's be honest about the rest: The honest limits — what generative AI is still bad at.