Chapter 16·Beginner·11 min read
The Context Window: How Much an LLM Can Actually See
What is the context window in an LLM? A clear explanation of the model's working memory — why it's a fixed token limit, why bigger windows cost more, the 'lost in the middle' problem, and how to work within the limit.
June 29, 2026
We've built the model: tokens in, embeddings, a Transformer doing the work. Now a practical question with huge consequences: how much text can the model actually consider at once? The answer is the context window, and understanding it explains a whole category of "why did it do that?" moments.
The Generative AI guide covers this from a beginner angle in tokens and context windows. Here we focus on the mechanics and trade-offs for the LLM-mechanics reader.
The model's working memory
The context window is the maximum number of tokens an LLM can take into account for a single response. Everything has to fit inside it at once:
- your prompt and instructions,
- the conversation so far,
- any documents or data you pasted in,
- and the response the model is generating.
Think of it as the size of the model's desk. Everything it needs to reason about must lie on the desk simultaneously. Anything that doesn't fit, it cannot use.
Why there's a limit at all
The limit comes straight from the Transformer. Recall that attention lets every token look at every other token. That's powerful, but the cost grows quadratically: double the tokens and the attention work roughly quadruples. Plus every token in the window has to be held in fast memory while the model runs.
So the window isn't an arbitrary cap — it's a budget set by compute and memory. That's also why expanding it is genuinely hard engineering, not a config flag.
Windows have grown dramatically as that engineering improved:
| Window size | Roughly equals | Era / use |
|---|---|---|
| ~4K tokens | A few pages | Early ChatGPT |
| ~32K tokens | A short report | Mid-generation models |
| ~128K tokens | A book chapter / small codebase | Common today |
| ~1M tokens | A whole book or large codebase | Frontier long-context models |
Why long chats "forget"
The window is fixed, but a conversation keeps growing. When the chat gets longer than the window, something has to give — and the usual rule is oldest text falls out first.
The dimmed messages have scrolled out of context. The model isn't being lazy or ignoring you — it genuinely cannot see them anymore. That's why it can suddenly forget your name, contradict an earlier decision, or re-ask something you already answered.
And remember from the Generative AI guide: the model has no memory between requests. A chat feels continuous only because the app re-sends the whole conversation each turn. The window is what caps how much of that history survives.
Bigger windows aren't a free win
If forgetting comes from a small window, why not make every window enormous? Some are. But there are real trade-offs:
- Cost and speed. You pay to process every token in the window on every turn. A million-token prompt is slow and expensive each time, not just once.
- "Lost in the middle." Even inside a large window, models tend to attend best to the start and end of the input and can overlook details buried in the middle of very long text.
- More isn't more relevant. Padding the window with marginally-related text can dilute the model's focus instead of helping.
| Small window | Large window | |
|---|---|---|
| Reads long documents | No | Yes |
| Cost per turn | Lower | Higher |
| Speed | Faster | Slower |
| Risk of missing buried details | Lower | Higher |
Working with the window
Once you see the window as a finite, billable resource, good habits follow:
- Front-load what matters. Put key instructions and facts early; don't bury them.
- Trim the irrelevant. More text is more noise and more cost, not more help.
- Restate critical facts in long sessions so they stay inside the window.
- Start fresh when the topic changes rather than dragging a huge stale history.
- Mind the total, not just your latest message — the whole conversation shares the window.
These same instincts underpin good prompt engineering, where managing the window deliberately is half the craft.
Recap
- The context window is the model's fixed working memory, measured in tokens — prompt, history, documents, and reply all share it.
- It's a hard boundary: text outside it is invisible, not merely deprioritised.
- The limit comes from the quadratic cost of attention plus memory; that's why it forgets when chats outgrow it.
- Bigger windows cost more and run slower, and can still miss details "lost in the middle."
- Manage the window deliberately: front-load, trim, restate, and reset.
We've mentioned that attention is what lets tokens see each other — and that it drives both the Transformer's power and the window's cost. It's time to actually explain it. Continue to Attention: how an LLM decides what matters.