LLM Inference Explained: How a Model Generates Text

What is inference in an LLM? A plain-English guide to how a large language model generates text — next-token prediction, sampling, temperature, why responses stream word by word, and why the same prompt can give different answers.

You type a prompt, hit send, and words stream back. That moment — the model actually running to produce a response — is called inference. We've spent the guide building and training the model; this chapter is about watching it work. Understanding inference explains why responses stream, why the same prompt can give different answers, and what that "temperature" setting actually does.

Training vs inference

A quick but important distinction:

	Training	Inference
What happens	Parameters are tuned	Parameters are used (frozen)
When	Once, ahead of time	Every time you send a prompt
Cost shape	Enormous, one-off	Smaller, but paid on every request
Analogy	Studying for the exam	Sitting the exam

During training, the model's parameters change. During inference, they're frozen — the model just uses what it learned. Every chat message you send triggers an inference run.

Generation is a loop

Here's the heart of it. The model doesn't compose a whole answer at once. It generates one token at a time, in a loop:

Read everything so far (your prompt + tokens generated already).
Predict the next token.
Append it to the text.
Go back to step 1 — now with that new token included.

Read text so far

Predict next token

Append it

Repeat until done

Inference is a loop: predict a token, append it, predict again

This is exactly why responses stream out word by word in the interface — you're watching the loop run in real time. It's also why the model can sometimes "talk itself" into a good answer or a bad one: each token it generates becomes part of the input for the next, so early tokens steer everything that follows.

Each step is a probability distribution

When the model predicts "the next token," it doesn't output a single word. It outputs a probability for every possible token in its vocabulary — a ranked list of candidates.

blue

62%

clear

14%

falling

grey

(others)

12%

After 'The sky is', the model's probabilities for the next token

Something then has to choose one token from this distribution. That choosing step is called sampling, and how it's done shapes the entire character of the output.

Sampling and temperature

The simplest strategy is "always take the most likely token" (called greedy). It's predictable but often dull and repetitive. So instead, models usually sample — pick a token with probability proportional to its odds, introducing controlled randomness.

The main dial controlling this is temperature:

Temperature	Effect	Good for
Low (≈0–0.3)	Picks safe, high-probability tokens	Facts, code, classification, consistency
Medium (≈0.7)	Balanced	General conversation
High (≈1.0+)	Flattens the odds; bolder choices	Brainstorming, creative writing

A related setting, top-p (nucleus sampling), restricts the choice to the smallest set of top tokens that together cover, say, 90% of the probability — cutting off the long tail of unlikely options. Temperature and top-p are often tuned together.

Why the same prompt gives different answers

This trips up newcomers: ask the same question twice and you may get two different responses. Now you know why — sampling involves randomness. Unless temperature is effectively zero, the model is rolling weighted dice at each step, so different runs take different paths.

This is a feature for creativity and a nuisance for reliability. When you need reproducible output — tests, classification, structured data — turn the temperature down. When you want variety, turn it up. It's a major lever in prompt evaluation, where you have to account for this run-to-run variation.

Why inference costs what it does

Every generated token is a full pass through the model. So:

Longer outputs are slower and pricier — each token is real computation.
Longer inputs cost too — the model processes your whole context window before generating, and re-reads the growing context each step (caching helps, but the cost is real).
Speed is measured in tokens per second — the rate the loop runs.

This is the practical reason to be concise: you're paying, in time and money, for every token in and every token out.

Recap

Inference is the model running on frozen parameters to generate a response — the exam, not the studying.
It generates one token at a time in a loop, which is why responses stream word by word.
Each step produces a probability distribution over all tokens; a sampling step chooses one.
Temperature controls focus vs variety: low for consistent/factual, high for creative.
Sampling randomness is why the same prompt can give different answers — lower the temperature for reproducibility.
Inference costs per token, in and out, so length directly drives speed and price.

We've now followed an LLM from raw text all the way to a streamed response. Before we finish, we have to be honest about what this remarkable machine still can't do. Continue to LLM limitations: what large language models can't do.