How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance

Chapter 89·Intermediate

01 / 05

Step one

Words become vectors before they touch pixels.

A text encoder (CLIP-style) turns your prompt into embeddings — the same trick LLMs use. The diffusion model never reads words; it reads meaning-coordinates.