Chapter 86·Beginner·10 min read
What Is AI Image Generation? How Text Becomes a Picture
What is AI image generation, really? A plain-English explanation of how tools like Midjourney, DALL·E, and Stable Diffusion turn a text prompt into a picture — what a diffusion model is, how we got here from GANs, and what's actually happening when you hit generate.
July 15, 2026
Type "a corgi astronaut, oil painting, dramatic lighting" into Midjourney and thirty seconds later you're looking at a painting that has never existed before — brushstrokes, fur, helmet reflections and all. It feels like magic, or theft, or both.
It's neither. This guide explains, chapter by chapter and with no math, how AI image and video generation actually works. We start with the shape of the whole thing: what these systems are, and what's really happening when you hit generate.
The simplest accurate definition
An AI image generator is a model that has learned what images look like — statistically, at every scale, from "grass is usually green" to "eyes come in pairs" to "oil paintings have visible brushwork" — and can use that knowledge to produce new images that fit a description.
The dominant technique is called a diffusion model, and its core move is wonderfully strange:
- Start with a canvas of pure random noise — television static.
- Remove a little bit of the noise, in the direction that makes the result look slightly more like a plausible image matching your prompt.
- Repeat, a few dozen times.
- What's left when the noise runs out is your picture.
If that raises more questions than it answers — how does removing noise create a corgi? — good. That's exactly what the next chapter unpacks. For now, the important part is what this process is not.
What it's not: a collage machine
The most common misconception about AI art is that the model searches its training images and stitches pieces together. It doesn't — it can't, because the training images aren't in the model.
Just as an LLM stores no database of facts, an image model stores no library of pictures. Training compressed billions of image-caption pairs into a few billion numerical parameters — patterns, not pixels. When you generate, the model reconstructs "corgi-ness" and "oil-painting-ness" from those patterns, applied to a fresh canvas of noise.
This doesn't settle the copyright debate — the model did learn from those images, and we take that question seriously in the limits chapter — but it does settle the mechanism: there is no cut, and no paste.
How we got here
Image generation didn't start with diffusion. The short history:
- 2014GANs
Generative Adversarial Networks: two networks — a forger and a detective — train against each other until the forger wins. First neural nets to produce convincing faces.
- 2020–21Diffusion breaks through
The noise-removal approach produces stunning images (2020), then formally beats GANs on image quality (2021) — while being far more stable to train.
- 2021DALL·E
OpenAI connects text to image generation at scale: describe it, get it.
- 2022The explosion
DALL·E 2, Midjourney, and the open-source Stable Diffusion arrive within months of each other. AI art goes mainstream.
- 2024+Video joins in
Sora-class models extend diffusion into time — generating coherent video clips from text.
The reason diffusion displaced GANs so quickly comes down to two words: stability and steerability. GAN training was a knife-edge duel between two networks that frequently collapsed. Diffusion training is a single, boring, reliable objective — predict the noise — that scales beautifully. And, crucially, each of diffusion's many small steps is an opportunity to inject guidance from a text prompt. GANs generated in one opaque leap; diffusion generates in fifty steerable nudges.
The cast of characters
The tools you've heard of are all variations on text-conditioned diffusion:
| Tool | Run by | Known for |
|---|---|---|
| Midjourney | Midjourney | Opinionated, painterly "house style"; strong aesthetics out of the box |
| DALL·E / GPT image | OpenAI | Instruction-following and in-chat editing |
| Stable Diffusion / SDXL | Stability AI (open weights) | Runs locally; endless community fine-tunes and control tools |
| Imagen / Veo | Photorealism and, with Veo, high-end video | |
| Flux | Black Forest Labs | Open-weight successor lineage to Stable Diffusion, strong realism |
The differences you feel between them — Midjourney's drama, DALL·E's obedience — come mostly from training data, fine-tuning taste, and the invisible prompt processing each product does, not from fundamentally different machinery.
One idea, two media
Here's the framing that makes this whole guide hang together. You already know from how LLMs work that a language model is a next-token predictor: learn the statistics of text, then sample plausible continuations.
An image model is the same philosophy on a different canvas:
| LLM | Diffusion model | |
|---|---|---|
| Learns | Patterns of text | Patterns of images |
| Generates by | Predicting the next token, repeatedly | Predicting the noise to remove, repeatedly |
| Steered by | Your prompt as context | Your prompt as guidance at every step |
| Fails by | Hallucinating plausible-but-wrong facts | Rendering plausible-but-wrong details (six fingers, garbled text) |
Both systems optimise for plausible, not true — the image-model equivalent of hallucination is a hand with six anatomically confident fingers.
Recap
- AI image generation is noise sculpted by a prompt: start from static, denoise step by step toward an image that matches the text.
- It is not collage — training images aren't stored in the model; every image is generated fresh from learned patterns.
- Diffusion replaced GANs because it trains stably and offers dozens of steerable steps instead of one opaque leap.
- Midjourney, DALL·E, Stable Diffusion, Imagen, and Flux are variations of the same text-conditioned diffusion recipe, differing in data and taste.
- Conceptually it's the LLM idea on a different canvas: learn the data's statistics, then sample — plausible over true, every time.
Next, the part that sounds impossible until you see it: how removing noise can create a picture. Continue to How diffusion models work.