Code Safari

Chapter 89·Intermediate

How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance

01 / 05

Step one

Words become vectors before they touch pixels.

A text encoder (CLIP-style) turns your prompt into embeddings — the same trick LLMs use. The diffusion model never reads words; it reads meaning-coordinates.

How the Prompt Steers the Image: CLIP, Cross-Attention & Guidance | Code Safari