Chapter 21·Beginner·11 min read
Tokens in LLMs: How AI Reads, Counts, and Bills Text
What is a token in an LLM? A clear explanation of tokenization — how large language models split text into tokens, why they can't count letters, why tokens decide cost and context limits, and how to estimate token counts.
June 29, 2026
In the previous chapter we said an LLM predicts the next token. So before we go further, we need to answer: what's a token? It's one of those small ideas that, once it clicks, explains a surprising number of an LLM's quirks — its cost, its limits, and why it sometimes fails at things that look trivial.
We introduced tokens briefly in the Generative AI guide's chapter on tokens and context windows. Here we go deeper and focus on the mechanics.
The model doesn't read words or letters
You read this sentence as letters grouped into words. An LLM does neither. Before it sees your text, a step called tokenization chops the text into tokens — chunks that are often a whole common word, but frequently a word-fragment.
Common words like " the" or " is" are usually a single token. Longer or rarer words get split: tokenization might become Token + ization. The model then works only with these tokens — it has no direct access to the letters inside them.
That single fact is the source of several famous oddities.
Why LLMs can't count letters or spell backwards
Ask a model "how many R's are in strawberry?" and it often stumbles. Why? Because it sees something like straw + berry — two tokens — not the eleven letters you see. You're asking it to report on information it can't directly perceive.
The same reason explains why models struggle to spell words backwards, count specific characters, or do letter-by-letter wordplay reliably. These aren't bugs in reasoning — they're a consequence of the model's unit of perception being the token.
Tokens are the unit of everything
Tokens aren't just how the model reads. They're how the entire system is measured, limited, and billed.
| Thing | Measured in tokens |
|---|---|
| How much you can send | Input tokens |
| How much it can reply | Output tokens |
| What you pay an API | Price per 1,000 (or per 1M) tokens |
| How much it can "see" at once | Context window size (a token count) |
| How fast it responds | Tokens generated per second |
This is why "tokens" appears on every pricing page and in every model spec. When a model advertises a "200K context window," that's 200,000 tokens — not words, not characters.
Estimating token counts
A reliable rule of thumb for English:
- 1 token ≈ ¾ of a word, or about 4 characters.
- So 1,000 tokens ≈ 750 words, and a typical page of prose is ~500 tokens.
These are estimates, not promises — exact counts depend on the specific tokenizer a model uses. But for budgeting cost and checking whether your text fits, they're close enough.
Why some text costs more than others
Not all text tokenizes equally. The model's vocabulary was learned from data, so it has efficient single tokens for common English and clumsier multi-token representations for everything else.
| Text type | Token efficiency | Why |
|---|---|---|
| Common English prose | Best | Frequent words are single tokens |
| Code | Worse | Symbols, indentation, and identifiers fragment |
| Numbers | Worse | Long numbers split into digit-chunks |
| Non-English / rare languages | Often worst | Underrepresented in the vocabulary |
The practical effect: the same meaning can cost noticeably more tokens in one language or format than another. If you're paying per token or fighting a context limit, this matters.
How tokenization actually happens
You don't need the internals, but a one-line version helps: tokenizers are built by scanning huge amounts of text and learning a fixed vocabulary of the most useful chunks — a technique often called byte-pair encoding. Frequent sequences become their own token; rare ones get assembled from smaller pieces. The vocabulary (often 50K–100K+ tokens) is fixed once training is done.
Each token then gets converted into a number (an ID), and from there into a list of numbers that carries its meaning — which is the subject of the next chapter.
Recap
- A token is the model's unit of text — usually a word, often a word-fragment. The model never sees the letters inside.
- That's why LLMs miscount letters and fumble spelling tricks — they're token-level, not character-level, readers.
- Tokens are the universal unit: context limits, pricing, and speed are all counted in tokens.
- Rule of thumb: 1 token ≈ ¾ word (~4 characters); 1,000 tokens ≈ 750 words.
- Code, numbers, and non-English text cost more tokens for the same meaning.
We've turned text into tokens. But a token ID like 4923 means nothing on its own. Next we see how the model turns tokens into meaning using embeddings. Continue to Embeddings: how LLMs turn words into meaning.