Code Safari

Chapter 21·Beginner·11 min read

Tokens in LLMs: How AI Reads, Counts, and Bills Text

What is a token in an LLM? A clear explanation of tokenization — how large language models split text into tokens, why they can't count letters, why tokens decide cost and context limits, and how to estimate token counts.

June 29, 2026

In the previous chapter we said an LLM predicts the next token. So before we go further, we need to answer: what's a token? It's one of those small ideas that, once it clicks, explains a surprising number of an LLM's quirks — its cost, its limits, and why it sometimes fails at things that look trivial.

We introduced tokens briefly in the Generative AI guide's chapter on tokens and context windows. Here we go deeper and focus on the mechanics.

The model doesn't read words or letters

You read this sentence as letters grouped into words. An LLM does neither. Before it sees your text, a step called tokenization chops the text into tokens — chunks that are often a whole common word, but frequently a word-fragment.

Tokenization is the first unavoidable step
Tokenization: common words are one token; rarer words split into pieces

Common words like " the" or " is" are usually a single token. Longer or rarer words get split: tokenization might become Token + ization. The model then works only with these tokens — it has no direct access to the letters inside them.

That single fact is the source of several famous oddities.

Why LLMs can't count letters or spell backwards

Ask a model "how many R's are in strawberry?" and it often stumbles. Why? Because it sees something like straw + berry — two tokens — not the eleven letters you see. You're asking it to report on information it can't directly perceive.

The same reason explains why models struggle to spell words backwards, count specific characters, or do letter-by-letter wordplay reliably. These aren't bugs in reasoning — they're a consequence of the model's unit of perception being the token.

Tokens are the unit of everything

Tokens aren't just how the model reads. They're how the entire system is measured, limited, and billed.

ThingMeasured in tokens
How much you can sendInput tokens
How much it can replyOutput tokens
What you pay an APIPrice per 1,000 (or per 1M) tokens
How much it can "see" at onceContext window size (a token count)
How fast it respondsTokens generated per second

This is why "tokens" appears on every pricing page and in every model spec. When a model advertises a "200K context window," that's 200,000 tokens — not words, not characters.

Estimating token counts

A reliable rule of thumb for English:

  • 1 token ≈ ¾ of a word, or about 4 characters.
  • So 1,000 tokens ≈ 750 words, and a typical page of prose is ~500 tokens.
A tweet
~40
This page
~1.5K
A short report
~4K
A novel chapter
~8K
Roughly how many tokens different inputs consume

These are estimates, not promises — exact counts depend on the specific tokenizer a model uses. But for budgeting cost and checking whether your text fits, they're close enough.

Why some text costs more than others

Not all text tokenizes equally. The model's vocabulary was learned from data, so it has efficient single tokens for common English and clumsier multi-token representations for everything else.

Text typeToken efficiencyWhy
Common English proseBestFrequent words are single tokens
CodeWorseSymbols, indentation, and identifiers fragment
NumbersWorseLong numbers split into digit-chunks
Non-English / rare languagesOften worstUnderrepresented in the vocabulary

The practical effect: the same meaning can cost noticeably more tokens in one language or format than another. If you're paying per token or fighting a context limit, this matters.

How tokenization actually happens

You don't need the internals, but a one-line version helps: tokenizers are built by scanning huge amounts of text and learning a fixed vocabulary of the most useful chunks — a technique often called byte-pair encoding. Frequent sequences become their own token; rare ones get assembled from smaller pieces. The vocabulary (often 50K–100K+ tokens) is fixed once training is done.

Raw text
Split into tokens
Map to token IDs
Into the model
Your text becomes tokens, then numbers the model can process

Each token then gets converted into a number (an ID), and from there into a list of numbers that carries its meaning — which is the subject of the next chapter.

Recap

  • A token is the model's unit of text — usually a word, often a word-fragment. The model never sees the letters inside.
  • That's why LLMs miscount letters and fumble spelling tricks — they're token-level, not character-level, readers.
  • Tokens are the universal unit: context limits, pricing, and speed are all counted in tokens.
  • Rule of thumb: 1 token ≈ ¾ word (~4 characters); 1,000 tokens ≈ 750 words.
  • Code, numbers, and non-English text cost more tokens for the same meaning.

We've turned text into tokens. But a token ID like 4923 means nothing on its own. Next we see how the model turns tokens into meaning using embeddings. Continue to Embeddings: how LLMs turn words into meaning.

Tokens in LLMs: How AI Reads, Counts, and Bills Text | Code Safari