Tokens in LLMs: How AI Reads, Counts, and Bills Text

What is a token in an LLM? A clear explanation of tokenization — how large language models split text into tokens, why they can't count letters, why tokens decide cost and context limits, and how to estimate token counts.

In the previous chapter we said an LLM predicts the next token. So before we go further, we need to answer: what's a token? It's one of those small ideas that, once it clicks, explains a surprising number of an LLM's quirks — its cost, its limits, and why it sometimes fails at things that look trivial.

We introduced tokens briefly in the Generative AI guide's chapter on tokens and context windows. Here we go deeper and focus on the mechanics.

The model doesn't read words or letters

You read this sentence as letters grouped into words. An LLM does neither. Before it sees your text, a step called tokenization chops the text into tokens — chunks that are often a whole common word, but frequently a word-fragment.

Tokenization is the first unavoidable step

Tokenization: common words are one token; rarer words split into pieces

Common words like " the" or " is" are usually a single token. Longer or rarer words get split: tokenization might become Token + ization. The model then works only with these tokens — it has no direct access to the letters inside them.

That single fact is the source of several famous oddities.

Why LLMs can't count letters or spell backwards

Ask a model "how many R's are in strawberry?" and it often stumbles. Why? Because it sees something like straw + berry — two tokens — not the eleven letters you see. You're asking it to report on information it can't directly perceive.

The same reason explains why models struggle to spell words backwards, count specific characters, or do letter-by-letter wordplay reliably. These aren't bugs in reasoning — they're a consequence of the model's unit of perception being the token.

Tokens are the unit of everything

Tokens aren't just how the model reads. They're how the entire system is measured, limited, and billed.

Thing	Measured in tokens
How much you can send	Input tokens
How much it can reply	Output tokens
What you pay an API	Price per 1,000 (or per 1M) tokens
How much it can "see" at once	Context window size (a token count)
How fast it responds	Tokens generated per second

This is why "tokens" appears on every pricing page and in every model spec. When a model advertises a "200K context window," that's 200,000 tokens — not words, not characters.

Estimating token counts

A reliable rule of thumb for English:

1 token ≈ ¾ of a word, or about 4 characters.
So 1,000 tokens ≈ 750 words, and a typical page of prose is ~500 tokens.

A tweet

~40

This page

~1.5K

A short report

~4K

A novel chapter

~8K

Roughly how many tokens different inputs consume

These are estimates, not promises — exact counts depend on the specific tokenizer a model uses. But for budgeting cost and checking whether your text fits, they're close enough.

Why some text costs more than others

Not all text tokenizes equally. The model's vocabulary was learned from data, so it has efficient single tokens for common English and clumsier multi-token representations for everything else.

Text type	Token efficiency	Why
Common English prose	Best	Frequent words are single tokens
Code	Worse	Symbols, indentation, and identifiers fragment
Numbers	Worse	Long numbers split into digit-chunks
Non-English / rare languages	Often worst	Underrepresented in the vocabulary

The practical effect: the same meaning can cost noticeably more tokens in one language or format than another. If you're paying per token or fighting a context limit, this matters.

How tokenization actually happens

You don't need the internals, but a one-line version helps: tokenizers are built by scanning huge amounts of text and learning a fixed vocabulary of the most useful chunks — a technique often called byte-pair encoding. Frequent sequences become their own token; rare ones get assembled from smaller pieces. The vocabulary (often 50K–100K+ tokens) is fixed once training is done.

Raw text

Split into tokens

Map to token IDs

Into the model

Your text becomes tokens, then numbers the model can process

Each token then gets converted into a number (an ID), and from there into a list of numbers that carries its meaning — which is the subject of the next chapter.

Recap

A token is the model's unit of text — usually a word, often a word-fragment. The model never sees the letters inside.
That's why LLMs miscount letters and fumble spelling tricks — they're token-level, not character-level, readers.
Tokens are the universal unit: context limits, pricing, and speed are all counted in tokens.
Rule of thumb: 1 token ≈ ¾ word (~4 characters); 1,000 tokens ≈ 750 words.
Code, numbers, and non-English text cost more tokens for the same meaning.

We've turned text into tokens. But a token ID like 4923 means nothing on its own. Next we see how the model turns tokens into meaning using embeddings. Continue to Embeddings: how LLMs turn words into meaning.