How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance

How does a text prompt actually control an AI image? A plain-English tour of conditioning — how CLIP-style encoders turn words into meaning the model can use, how cross-attention lets every image region consult the prompt, and what the guidance scale (CFG) slider really does.

Across the last two chapters, the prompt has hovered offstage — "the model denoises, steered by your text." This chapter drags the steering machinery into the light. It has three parts: how words become something a vision model can use, how they're injected into every denoising step, and how the strength of their influence is controlled.

Part 1: Words become vectors

A diffusion model doesn't read. Before your prompt touches the canvas, a separate text encoder converts it into embeddings — lists of numbers positioned in a space where similar meanings sit close together, exactly as in LLMs.

The classic choice is a CLIP-style encoder, and how CLIP was trained explains a lot about how prompting behaves. CLIP learned from hundreds of millions of image–caption pairs, with one objective: put each image and its true caption close together in a shared space, and mismatched pairs far apart. After training, "a corgi on a beach" the sentence and a corgi-on-a-beach photo land in the same neighbourhood of meaning.

That shared space is the hinge of the whole system: text and images become comparable objects. The prompt enters the diffusion model not as words but as a set of meaning-coordinates the model was trained to move toward.

Part 2: Cross-attention — the bridge to pixels

So the prompt is now a row of embedding vectors. How do they push the canvas around?

Through cross-attention — the same attention mechanism from Transformers, pointed across media. In ordinary self-attention, parts of a text look at each other. In cross-attention, parts of the image look at the prompt:

At every denoising step, every region of the (latent) canvas gets to query the prompt's tokens: "anything relevant to what I should become?"
The region resolving into sky pulls hardest from "sunset"; the region resolving into a dog pulls from "corgi"; brushwork everywhere pulls from "oil painting."

Prompt → embeddings (CLIP)

Each image region queries the tokens

Relevant words weight each region

Denoising nudge, locally steered

Inside every single denoising step, the canvas consults the prompt via cross-attention.

Two things fall out of this design. First, spatial control: words land in the right places because each place chose its own words. Second, repetition of influence: the consultation happens at every one of the 20–50 steps, so the prompt shapes the earliest compositional decisions and the final texture polish alike — this is the "fifty steerable nudges" advantage over GANs from chapter 1.

Part 3: The guidance scale — how loudly the prompt shouts

Knowing the direction the prompt wants isn't the whole story; there's also how hard to push. That's the guidance scale — the CFG slider in most tools, for classifier-free guidance — and the trick behind it is delightfully blunt.

At each step, the model actually denoises twice:

Once with your prompt — "which way toward a corgi on a beach?"
Once with no prompt at all — "which way toward any plausible image?"

Subtract the two and you get a pure arrow: the direction the prompt specifically is asking for. Classifier-free guidance exaggerates that arrow — moving further along it than the honest prediction suggests. The guidance scale is the exaggeration factor:

Guidance scale	Behaviour
Low (1–4)	Loose interpretation — natural, varied, sometimes off-brief
Medium (5–9)	The usual sweet spot — faithful and clean
High (10–15)	Very literal — colours oversaturate, contrast crunches
Very high (20+)	"Deep-fried": burnt colours, halos, distorted anatomy

That deep-fried look at high CFG isn't the model trying harder and failing — it's the exaggeration pushing the image past plausibility in the prompt's direction, like turning a photo's saturation slider far beyond 100%.

Negative prompts: guidance in reverse

Once you see the two-pass trick, negative prompts stop being mysterious. Instead of comparing your prompt against no prompt, the tool compares it against the negative prompt — so every step moves toward "corgi on a beach" and away from "blurry, extra limbs, watermark." It's the same subtraction with the baseline swapped. Nothing is filtered afterward; the avoidance is baked into every denoising nudge.

The full pipeline, assembled

All four chapters so far, in one line:

Prompt → CLIP embeddings

Noise in latent space

Denoise ×N: cross-attention + guidance

VAE decodes → image

Text-to-image, end to end.

That's the complete machine. Everything a tool exposes — steps, seed, guidance, negative prompt — now maps to a specific part you understand.

Recap

A text encoder (CLIP-style) turns the prompt into embeddings in a space where text and images are comparable — the model receives meaning, not words.
Cross-attention lets every image region consult the prompt's tokens at every denoising step — that's how "sunset" ends up in the sky and "corgi" in the dog.
Classifier-free guidance denoises with and without the prompt and exaggerates the difference; the CFG slider sets the exaggeration — literal-but-fried at high values.
Negative prompts are the same subtraction with the baseline replaced — steering away is guidance in reverse.
Most prompt-misreading (swapped attributes, bleeding adjectives) originates in the text encoder, not the diffusion loop.

You now know what every slider actually does — which makes you dangerous. Time to put it to work. Continue to Prompting AI image generators.