Tokens, Context Windows, and Why AI Sometimes Forgets

Why does AI lose track of long conversations, miscount letters, or 'forget' what you said earlier? It comes down to tokens and the context window. A clear, no-math explanation of how AI reads — and why it has a memory limit.

You're deep in a long chat with an AI. It was tracking everything beautifully — then suddenly it forgets a detail you gave it ten messages ago, or contradicts itself. Or you ask it to count the letters in "strawberry" and it gets it wrong. What's going on?

These aren't random failures. They come from two of the most important concepts in how AI works: tokens and the context window. Get these and a whole category of "why did it do that?" moments becomes obvious.

AI doesn't read letters — it reads tokens

We touched on tokens in chapter one. Now let's take them seriously, because they explain a surprising amount.

Before a model sees your text, the text is broken into tokens — chunks that are often a whole word, but sometimes a fragment. This step is called tokenization. Common words are usually a single token; rarer or longer words get split into pieces.

Tokenization makes unbelievable words into pieces

Tokenization: common words are one token, rarer words split into pieces

The model only ever works with these tokens. It does not see the individual letters inside them. That single fact explains several of AI's most famous oddities:

Counting letters. "How many R's in strawberry?" is hard because the model sees the word as a couple of tokens, not a string of letters. It's being asked about something it can't directly perceive.
Spelling backwards or by letter. Same reason — letter-level operations are awkward for a token-level reader.
Different languages cost differently. Languages that tokenize into more pieces use up more of the model's capacity for the same meaning.

Tokens are also the unit of everything

Tokens aren't just how the model reads — they're the unit the whole system is measured and billed in:

Thing	Measured in tokens
How much you can send	Input tokens
How much it can reply	Output tokens
What you pay	Price per 1,000 (or 1M) tokens
How much it can "remember"	Context window size

A rough rule of thumb in English: 1 token ≈ ¾ of a word, or about 4 characters. So 1,000 tokens is roughly 750 words. Useful for estimating cost and whether your text will fit.

The context window: the model's working memory

Here's the crucial limit. A model can only "see" a fixed number of tokens at once. This is its context window — everything it can take into account for a single response: your prompt, the conversation so far, any documents you pasted, and the answer it's generating.

Think of it as the model's working memory, or the size of its desk. Everything it needs to reason about has to fit on the desk at the same time. Anything that doesn't fit, it cannot use.

Context windows have grown enormously, which is why modern models can read whole documents:

Window size	Roughly equals	Era / use
~4K tokens	A few pages	Early ChatGPT
~32K tokens	A short report	Mid-generation models
~128K tokens	A long book chapter / small codebase	Common today
~1M tokens	A whole book or large codebase	Frontier long-context models

Why AI "forgets" in long conversations

Now the forgetting makes sense. In a long chat, the conversation keeps growing, but the context window is fixed. When the conversation gets longer than the window, something has to give — and the usual rule is oldest text falls out first to make room for new text.

msg 1msg 2msg 3msg 4msg 5msg 6msg 7

In a long chat, the earliest messages slide out of the window and become invisible

The dimmed messages here have scrolled out of the window. The model isn't ignoring them or being lazy — it genuinely cannot see them anymore. From its perspective, that part of the conversation never happened. That's why it can suddenly "forget" your name, contradict an earlier decision, or re-ask something you already answered.

A bigger surprise: models have no memory at all

Here's something many people get wrong. By default, a model has no memory between requests whatsoever. Each time it generates a response, it sees only the tokens in front of it — and then it's done. It doesn't store the conversation anywhere.

So how does a chat feel continuous? The app does the work. Every time you send a new message, it quietly re-sends the entire conversation so far back to the model as part of the prompt.

You send message

App prepends past messages

Model reads it all fresh

Model replies

The illusion of memory: the app re-sends the whole conversation each turn

The "memory" you experience is really the app refilling the desk each turn. This is also why:

Conversations get slower and pricier as they grow — there's more to re-send and re-read every time.
Once the history exceeds the window, the oldest parts get trimmed from the re-sent prompt, and the forgetting begins.
"Memory" features in AI products are a separate system that saves notes about you and re-injects them — not the model truly remembering.

Bigger windows aren't a free win

If forgetting comes from a small window, why not just make windows huge? Some are. But there are real trade-offs:

Cost and speed. Processing more tokens takes more computation. A million-token prompt is slow and expensive every single turn.
"Lost in the middle." Even within a large window, models often attend best to the start and end of the input and can overlook details buried in the middle of very long text.
More isn't more relevant. Stuffing the window with marginally-related text can dilute the model's focus rather than help it.

	Small window	Large window
Reads long documents	No	Yes
Cost per turn	Lower	Higher
Speed	Faster	Slower
Risk of missing buried details	Lower	Higher

Working with the window

Once you see the context window as a finite resource, good habits follow naturally:

Front-load what matters. Put key instructions and facts early; don't bury them.
Trim the irrelevant. More text isn't more help — it's more noise and more cost.
Restate key facts in long sessions. If something from early on is critical, repeat it so it stays in the window.
Start fresh when the topic changes. A clean conversation beats one dragging a huge, stale history.
Mind the total, not just your latest message — the whole conversation shares the window.

Recap

AI reads in tokens (word-ish chunks), never individual letters — which is why it miscounts letters and struggles with spelling tricks.
The context window is the model's fixed working memory; text outside it is invisible.
AI "forgets" because old messages fall out of the window as the conversation grows.
Models have no built-in memory — apps fake it by re-sending the conversation each turn.
Bigger windows help but cost more, run slower, and can still miss buried details.
Manage the window: be concise, relevant, and explicit.

We've now covered how AI works, where it came from, what's inside it, and two of its core limitations. For the finale, let's be honest about the rest: The honest limits — what generative AI is still bad at.