The Context Window: How Much an LLM Can Actually See

What is the context window in an LLM? A clear explanation of the model's working memory — why it's a fixed token limit, why bigger windows cost more, the 'lost in the middle' problem, and how to work within the limit.

We've built the model: tokens in, embeddings, a Transformer doing the work. Now a practical question with huge consequences: how much text can the model actually consider at once? The answer is the context window, and understanding it explains a whole category of "why did it do that?" moments.

The Generative AI guide covers this from a beginner angle in tokens and context windows. Here we focus on the mechanics and trade-offs for the LLM-mechanics reader.

The model's working memory

The context window is the maximum number of tokens an LLM can take into account for a single response. Everything has to fit inside it at once:

your prompt and instructions,
the conversation so far,
any documents or data you pasted in,
and the response the model is generating.

Think of it as the size of the model's desk. Everything it needs to reason about must lie on the desk simultaneously. Anything that doesn't fit, it cannot use.

Why there's a limit at all

The limit comes straight from the Transformer. Recall that attention lets every token look at every other token. That's powerful, but the cost grows quadratically: double the tokens and the attention work roughly quadruples. Plus every token in the window has to be held in fast memory while the model runs.

So the window isn't an arbitrary cap — it's a budget set by compute and memory. That's also why expanding it is genuinely hard engineering, not a config flag.

Windows have grown dramatically as that engineering improved:

Window size	Roughly equals	Era / use
~4K tokens	A few pages	Early ChatGPT
~32K tokens	A short report	Mid-generation models
~128K tokens	A book chapter / small codebase	Common today
~1M tokens	A whole book or large codebase	Frontier long-context models

Why long chats "forget"

The window is fixed, but a conversation keeps growing. When the chat gets longer than the window, something has to give — and the usual rule is oldest text falls out first.

msg 1msg 2msg 3msg 4msg 5msg 6msg 7

In a long chat, the earliest messages slide out of the window and become invisible

The dimmed messages have scrolled out of context. The model isn't being lazy or ignoring you — it genuinely cannot see them anymore. That's why it can suddenly forget your name, contradict an earlier decision, or re-ask something you already answered.

And remember from the Generative AI guide: the model has no memory between requests. A chat feels continuous only because the app re-sends the whole conversation each turn. The window is what caps how much of that history survives.

Bigger windows aren't a free win

If forgetting comes from a small window, why not make every window enormous? Some are. But there are real trade-offs:

Cost and speed. You pay to process every token in the window on every turn. A million-token prompt is slow and expensive each time, not just once.
"Lost in the middle." Even inside a large window, models tend to attend best to the start and end of the input and can overlook details buried in the middle of very long text.
More isn't more relevant. Padding the window with marginally-related text can dilute the model's focus instead of helping.

Near the start

high

In the middle

lower

Near the end

high

Illustrative: recall of a fact by its position in a very long input

	Small window	Large window
Reads long documents	No	Yes
Cost per turn	Lower	Higher
Speed	Faster	Slower
Risk of missing buried details	Lower	Higher

Working with the window

Once you see the window as a finite, billable resource, good habits follow:

Front-load what matters. Put key instructions and facts early; don't bury them.
Trim the irrelevant. More text is more noise and more cost, not more help.
Restate critical facts in long sessions so they stay inside the window.
Start fresh when the topic changes rather than dragging a huge stale history.
Mind the total, not just your latest message — the whole conversation shares the window.

These same instincts underpin good prompt engineering, where managing the window deliberately is half the craft.

Recap

The context window is the model's fixed working memory, measured in tokens — prompt, history, documents, and reply all share it.
It's a hard boundary: text outside it is invisible, not merely deprioritised.
The limit comes from the quadratic cost of attention plus memory; that's why it forgets when chats outgrow it.
Bigger windows cost more and run slower, and can still miss details "lost in the middle."
Manage the window deliberately: front-load, trim, restate, and reset.

We've mentioned that attention is what lets tokens see each other — and that it drives both the Transformer's power and the window's cost. It's time to actually explain it. Continue to Attention: how an LLM decides what matters.