Chapter 89·Intermediate
How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance
01 / 05
Step one
Words become vectors before they touch pixels.
A text encoder (CLIP-style) turns your prompt into embeddings — the same trick LLMs use. The diffusion model never reads words; it reads meaning-coordinates.