Chapter 89·Intermediate·10 min read
How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance
How does a text prompt actually control an AI image? A plain-English tour of conditioning — how CLIP-style encoders turn words into meaning the model can use, how cross-attention lets every image region consult the prompt, and what the guidance scale (CFG) slider really does.
July 18, 2026
Across the last two chapters, the prompt has hovered offstage — "the model denoises, steered by your text." This chapter drags the steering machinery into the light. It has three parts: how words become something a vision model can use, how they're injected into every denoising step, and how the strength of their influence is controlled.
Part 1: Words become vectors
A diffusion model doesn't read. Before your prompt touches the canvas, a separate text encoder converts it into embeddings — lists of numbers positioned in a space where similar meanings sit close together, exactly as in LLMs.
The classic choice is a CLIP-style encoder, and how CLIP was trained explains a lot about how prompting behaves. CLIP learned from hundreds of millions of image–caption pairs, with one objective: put each image and its true caption close together in a shared space, and mismatched pairs far apart. After training, "a corgi on a beach" the sentence and a corgi-on-a-beach photo land in the same neighbourhood of meaning.
That shared space is the hinge of the whole system: text and images become comparable objects. The prompt enters the diffusion model not as words but as a set of meaning-coordinates the model was trained to move toward.
Part 2: Cross-attention — the bridge to pixels
So the prompt is now a row of embedding vectors. How do they push the canvas around?
Through cross-attention — the same attention mechanism from Transformers, pointed across media. In ordinary self-attention, parts of a text look at each other. In cross-attention, parts of the image look at the prompt:
- At every denoising step, every region of the (latent) canvas gets to query the prompt's tokens: "anything relevant to what I should become?"
- The region resolving into sky pulls hardest from "sunset"; the region resolving into a dog pulls from "corgi"; brushwork everywhere pulls from "oil painting."
Two things fall out of this design. First, spatial control: words land in the right places because each place chose its own words. Second, repetition of influence: the consultation happens at every one of the 20–50 steps, so the prompt shapes the earliest compositional decisions and the final texture polish alike — this is the "fifty steerable nudges" advantage over GANs from chapter 1.
Part 3: The guidance scale — how loudly the prompt shouts
Knowing the direction the prompt wants isn't the whole story; there's also how hard to push. That's the guidance scale — the CFG slider in most tools, for classifier-free guidance — and the trick behind it is delightfully blunt.
At each step, the model actually denoises twice:
- Once with your prompt — "which way toward a corgi on a beach?"
- Once with no prompt at all — "which way toward any plausible image?"
Subtract the two and you get a pure arrow: the direction the prompt specifically is asking for. Classifier-free guidance exaggerates that arrow — moving further along it than the honest prediction suggests. The guidance scale is the exaggeration factor:
| Guidance scale | Behaviour |
|---|---|
| Low (1–4) | Loose interpretation — natural, varied, sometimes off-brief |
| Medium (5–9) | The usual sweet spot — faithful and clean |
| High (10–15) | Very literal — colours oversaturate, contrast crunches |
| Very high (20+) | "Deep-fried": burnt colours, halos, distorted anatomy |
That deep-fried look at high CFG isn't the model trying harder and failing — it's the exaggeration pushing the image past plausibility in the prompt's direction, like turning a photo's saturation slider far beyond 100%.
Negative prompts: guidance in reverse
Once you see the two-pass trick, negative prompts stop being mysterious. Instead of comparing your prompt against no prompt, the tool compares it against the negative prompt — so every step moves toward "corgi on a beach" and away from "blurry, extra limbs, watermark." It's the same subtraction with the baseline swapped. Nothing is filtered afterward; the avoidance is baked into every denoising nudge.
The full pipeline, assembled
All four chapters so far, in one line:
That's the complete machine. Everything a tool exposes — steps, seed, guidance, negative prompt — now maps to a specific part you understand.
Recap
- A text encoder (CLIP-style) turns the prompt into embeddings in a space where text and images are comparable — the model receives meaning, not words.
- Cross-attention lets every image region consult the prompt's tokens at every denoising step — that's how "sunset" ends up in the sky and "corgi" in the dog.
- Classifier-free guidance denoises with and without the prompt and exaggerates the difference; the CFG slider sets the exaggeration — literal-but-fried at high values.
- Negative prompts are the same subtraction with the baseline replaced — steering away is guidance in reverse.
- Most prompt-misreading (swapped attributes, bleeding adjectives) originates in the text encoder, not the diffusion loop.
You now know what every slider actually does — which makes you dangerous. Time to put it to work. Continue to Prompting AI image generators.