How Diffusion Models Work: From Pure Noise to a Picture, Step by Step

How does a diffusion model actually generate an image? A plain-English walkthrough of the noising-and-denoising trick — how the model trains by destroying images, why generation runs the film backwards, and what the 'steps' setting really does.

The previous chapter left you with a suspicious claim: modern image generators work by removing noise from random static until a picture appears. This chapter makes that claim make sense — because once you see the training trick, the whole thing becomes almost obvious.

Start with the film running forwards

Forget generation for a moment. Take a real photo of a corgi and do something destructive to it: add a small amount of random noise. It's now a slightly grainy corgi. Add more noise. Grainier. Keep going — a few hundred rounds of this — and the corgi is gone entirely. What's left is indistinguishable from television static.

That's the forward process: image → noise. It's trivial, mechanical, and requires no intelligence at all. Anyone can destroy a photo.

The insight behind diffusion models is that this boring destruction creates a perfect training curriculum for the reverse.

The training trick

Here's what training actually looks like, millions of times over:

Take a real image from the training set.
Add a known, randomly chosen amount of noise to it.
Show the model the noisy result (plus its caption).
Ask: "what noise was added?"
Compare the model's guess to the noise you actually added — you know, because you added it — and adjust the model's parameters to make the guess better.

Real image + caption

Add a known dose of noise

Model predicts the noise

Correct it against the truth

One training example: destroy an image, then ask the model to identify the destruction.

Notice how clean this is. Every training example comes with a perfect answer key — the exact noise you added — manufactured for free. No human labelling, no adversarial duel like GANs needed. Just billions of rounds of "spot the noise," at every noise level from nearly clean to nearly pure static.

And here's the sleight of hand: to get good at spotting noise, the model is forced to learn what images look like. In a heavily noised picture captioned "a corgi on a beach," the only way to say which speckles are noise is to know what corgis, sand, and sunlight should look like. The noise-prediction task smuggles in a complete education about the visual world — the same way next-token prediction smuggles world knowledge into LLMs.

Now run the film backwards

Generation is the reverse process, and after training it's almost anticlimactic:

Fill a canvas with pure random noise — no image underneath, just static.
Ask the model: "what part of this is noise?" (conditioned on your prompt — more on that in chapter 4).
Subtract a fraction of its answer. The canvas now looks microscopically less like static and microscopically more like an image.
Repeat, typically 20–50 times, each pass removing more noise.
Stop when no noise budget remains — the canvas is now a coherent picture.

The beautiful weirdness: in step 2, the model is hallucinating in the most productive possible way. There is no corgi under the static — but the model, trained to see images under noise, behaves as if there is one, and its systematic "mistake" assembles a corgi over fifty steps. The model dreams an image into the noise, a nudge at a time.

Where does the image "come from"?

If the model adds no information, and the prompt is just a sentence, what decides that this corgi has these ears at this angle on this beach?

The starting noise. That initial canvas of static — chosen by a random number called the seed — is the block of marble. The prompt tells the model what to carve; the seed determines which specific statue emerges. Different seeds land in different valleys of the space of all plausible corgi-beach images.

This is why:

Regenerating gives variations, not repeats — new seed, new marble.
Fixing the seed gives reproducibility — same seed + same prompt + same settings = the same image, which is how communities share exact recipes.
Small prompt tweaks with a fixed seed produce eerily controlled edits — you're re-carving the same block with slightly different instructions.

What the "steps" setting really does

Most tools expose a steps (or "quality") control. Now you know exactly what it is: how many denoising passes the model makes on the way from static to picture.

Steps	What you get
~10	Fast, soft, often smudgy — the noise wasn't fully resolved
~20–30	The sweet spot for most modern models
~50	Marginal gains on fine detail
100+	Mostly a slower version of 50

Quality climbs steeply early and then flattens — each extra step refines less than the one before. Modern samplers (the schedulers that decide how big each denoising bite is) are largely about getting to a great image in fewer steps; some distilled models now produce strong results in 4–8.

Recap

Training is "spot the noise": add a known dose of noise to a real image, ask the model to predict it, correct against the free, perfect answer key.
That task forces the model to learn what images look like — you can't separate noise from corgi without knowing corgis.
Generation runs the film backwards: start from pure static and subtract predicted noise over ~20–50 steps, letting the model dream structure into randomness.
The seed is the marble block — it's why outputs vary, why fixed seeds reproduce exactly, and where each specific image "comes from."
The steps setting = number of denoising passes; quality saturates quickly, so more isn't always better.

One puzzle remains at this scale: doing this on millions of pixels should be brutally slow — yet Stable Diffusion runs on a gaming laptop. The trick is that the denoising doesn't happen on pixels at all. Continue to Latent diffusion, explained.