Chapter 19·Intermediate·11 min read
LLM Inference Explained: How a Model Generates Text
What is inference in an LLM? A plain-English guide to how a large language model generates text — next-token prediction, sampling, temperature, why responses stream word by word, and why the same prompt can give different answers.
June 29, 2026
You type a prompt, hit send, and words stream back. That moment — the model actually running to produce a response — is called inference. We've spent the guide building and training the model; this chapter is about watching it work. Understanding inference explains why responses stream, why the same prompt can give different answers, and what that "temperature" setting actually does.
Training vs inference
A quick but important distinction:
| Training | Inference | |
|---|---|---|
| What happens | Parameters are tuned | Parameters are used (frozen) |
| When | Once, ahead of time | Every time you send a prompt |
| Cost shape | Enormous, one-off | Smaller, but paid on every request |
| Analogy | Studying for the exam | Sitting the exam |
During training, the model's parameters change. During inference, they're frozen — the model just uses what it learned. Every chat message you send triggers an inference run.
Generation is a loop
Here's the heart of it. The model doesn't compose a whole answer at once. It generates one token at a time, in a loop:
- Read everything so far (your prompt + tokens generated already).
- Predict the next token.
- Append it to the text.
- Go back to step 1 — now with that new token included.
This is exactly why responses stream out word by word in the interface — you're watching the loop run in real time. It's also why the model can sometimes "talk itself" into a good answer or a bad one: each token it generates becomes part of the input for the next, so early tokens steer everything that follows.
Each step is a probability distribution
When the model predicts "the next token," it doesn't output a single word. It outputs a probability for every possible token in its vocabulary — a ranked list of candidates.
Something then has to choose one token from this distribution. That choosing step is called sampling, and how it's done shapes the entire character of the output.
Sampling and temperature
The simplest strategy is "always take the most likely token" (called greedy). It's predictable but often dull and repetitive. So instead, models usually sample — pick a token with probability proportional to its odds, introducing controlled randomness.
The main dial controlling this is temperature:
| Temperature | Effect | Good for |
|---|---|---|
| Low (≈0–0.3) | Picks safe, high-probability tokens | Facts, code, classification, consistency |
| Medium (≈0.7) | Balanced | General conversation |
| High (≈1.0+) | Flattens the odds; bolder choices | Brainstorming, creative writing |
A related setting, top-p (nucleus sampling), restricts the choice to the smallest set of top tokens that together cover, say, 90% of the probability — cutting off the long tail of unlikely options. Temperature and top-p are often tuned together.
Why the same prompt gives different answers
This trips up newcomers: ask the same question twice and you may get two different responses. Now you know why — sampling involves randomness. Unless temperature is effectively zero, the model is rolling weighted dice at each step, so different runs take different paths.
This is a feature for creativity and a nuisance for reliability. When you need reproducible output — tests, classification, structured data — turn the temperature down. When you want variety, turn it up. It's a major lever in prompt evaluation, where you have to account for this run-to-run variation.
Why inference costs what it does
Every generated token is a full pass through the model. So:
- Longer outputs are slower and pricier — each token is real computation.
- Longer inputs cost too — the model processes your whole context window before generating, and re-reads the growing context each step (caching helps, but the cost is real).
- Speed is measured in tokens per second — the rate the loop runs.
This is the practical reason to be concise: you're paying, in time and money, for every token in and every token out.
Recap
- Inference is the model running on frozen parameters to generate a response — the exam, not the studying.
- It generates one token at a time in a loop, which is why responses stream word by word.
- Each step produces a probability distribution over all tokens; a sampling step chooses one.
- Temperature controls focus vs variety: low for consistent/factual, high for creative.
- Sampling randomness is why the same prompt can give different answers — lower the temperature for reproducibility.
- Inference costs per token, in and out, so length directly drives speed and price.
We've now followed an LLM from raw text all the way to a streamed response. Before we finish, we have to be honest about what this remarkable machine still can't do. Continue to LLM limitations: what large language models can't do.