Attention Explained: How an LLM Decides What Matters

What is attention in a Transformer? A no-math explanation of the self-attention mechanism behind every LLM — how tokens decide which other tokens to focus on, what queries, keys, and values mean, and why multi-head attention exists.

We keep promising to explain attention — the mechanism the famous 2017 paper named in its title, "Attention Is All You Need." It's the beating heart of the Transformer and the reason LLMs can handle context, ambiguity, and long-range connections. Let's finally unpack it, with zero math.

The problem attention solves

Meaning depends on context. Consider:

"The trophy didn't fit in the suitcase because it was too big."

What is it? The trophy. Now change one word:

"The trophy didn't fit in the suitcase because it was too small."

Now it is the suitcase. The word it didn't change — the context did. To understand the sentence, the model has to let it look around, weigh the other words, and figure out which one it refers to.

That "look around and weigh the other words" is exactly what attention does.

Query, key, value: a search analogy

Here's the mechanism without equations. For each token, the model creates three things:

Component	Plain meaning	Analogy
Query	"What am I looking for?"	Your search box text
Key	"What do I offer / what am I about?"	A web page's title/tags
Value	"The actual information I'll share"	The page's content

Every token broadcasts a key (advertising what it's about) and holds a value (the information it can contribute). Then each token forms a query describing what it needs, and compares its query against all the keys. Where a query and a key match strongly, that token pulls in a large share of the matching token's value.

Token forms a query

Compare to every key

Strong matches → share more value

Token updated with context

Attention as search: a token's query is matched against all keys to gather values

So when it forms its query "what noun am I standing in for?", it matches strongly against trophy (or suitcase, depending on the rest of the sentence) and absorbs that meaning. The pronoun gets resolved — not by a rule, but by learned matching.

"Self"-attention

In an LLM, tokens attend to other tokens in the same sequence. That's why it's called self-attention — the sequence is attending to itself, rather than to some separate input.

The trophy didn't fit because it was big

'it' attends most strongly to 'trophy' — that's how the reference is resolved

Every token does this simultaneously, in parallel — which is precisely the property that makes the Transformer fast and scalable, as we saw in the Transformer chapter. It's also why the context window has a cost: every token comparing against every other token is where the quadratic expense comes from.

Why "multi-head" attention

Relationships in language come in many flavours at once. In one sentence you might care about:

grammar (which verb goes with which subject),
reference (what it points to),
topic (what the sentence is broadly about).

A single attention operation would have to cram all of that into one pattern. So Transformers run multiple attention operations in parallel — called heads — and each head is free to learn a different kind of relationship. Their results are then combined.

Head: grammar

Head: reference

Head: topic

Head: position

Illustrative: different attention heads specialise in different relationships

This is multi-head attention. Think of it as several readers examining the same sentence, each tracking a different dimension of meaning, then pooling their notes. It's far richer than any single pattern could be.

How attention builds understanding

Recall that a Transformer stacks many layers. Attention happens in every layer, so context-sharing compounds:

Early layers use attention to resolve local things — word forms, nearby grammar.
Middle layers connect words across the sentence — references, relationships.
Late layers integrate the whole passage — topic, intent, what should come next.

By the top of the stack, each token's embedding has been reshaped by everything relevant around it. The static "draft" meaning we mentioned in the embeddings chapter has become a fully contextual meaning. That's how bank by a river ends up meaning something different from bank near loan — attention pulled in the disambiguating neighbours.

Recap

Attention lets each token decide how much to focus on every other token and pull in their information.
It's how the model resolves ambiguity like what a pronoun refers to — by context, not rules.
The mechanism uses queries (what I want), keys (what I offer), and values (the info shared); strong query–key matches share more value.
In LLMs it's self-attention — tokens attend to others in the same sequence, in parallel.
Multi-head attention runs several attention patterns at once, each learning a different kind of relationship.
Repeated across every layer, attention turns draft embeddings into fully contextual meaning.

We now understand the machine completely: tokens, embeddings, a Transformer, attention. But where do the billions of parameters that make it work actually come from? That's training. Continue to How LLMs are trained: pretraining and RLHF.