How AI Video Generation Works: Sora, Veo, and the Hard Problem of Time

How do Sora-class models generate video from text? A plain-English explanation of video diffusion — spacetime latents, why temporal consistency is the hard part, the 'world model' debate, and why physics glitches happen.

Everything in this guide so far generates one frozen moment. Video asks for something categorically harder: hundreds of moments that agree with each other. The corgi must stay the same corgi while it runs; shadows must move with the sun that casts them.

This chapter explains how Sora-class systems (OpenAI's Sora, Google's Veo, Runway, Kling and friends) extend diffusion into time — and why the results are simultaneously astonishing and prone to physics-defying glitches.

The naive approach fails instantly

The obvious idea: generate frame 1 with an image model, then generate frame 2 "similar to frame 1," and so on. It fails for a reason you can predict from chapter 2: every generation is a fresh sample from a space of plausible images. Chain them and tiny disagreements compound — the corgi's markings drift, the background wobbles, and within a second you have a fever dream, not a video.

Frame-by-frame generation treats consistency as an afterthought. The fix is to make it structural.

The real approach: denoise a block of spacetime

Modern video models generate the whole clip as one object. Instead of a 2D canvas of noise, the model starts with a 3D block — width × height × time — and denoises the entire thing together over the usual 20–50 steps.

Noise block: W × H × T

Denoise the whole volume, prompt-guided

Every frame negotiated with its neighbours

Decode → coherent clip

Video diffusion: one denoising loop over a spacetime volume, not a chain of images.

Because every denoising step sees all frames at once, consistency is enforced during generation, not patched afterwards. The corgi in frame 80 and the corgi in frame 1 are resolved from the same block, by the same passes, under the same prompt — they agree because they were never separate.

Everything you learned about the still-image pipeline carries over:

Latent space, extended: clips are compressed by a video autoencoder across space and time before any diffusion happens — latent diffusion with one more axis. Raw video is so enormous this isn't an optimisation, it's the only way the problem fits in memory at all.
Text conditioning, unchanged: the prompt becomes embeddings, and cross-attention steers every region of every frame.

Patches: video becomes tokens

The second big idea in Sora-class systems is architectural. Chop the compressed spacetime block into small patches — little cubes of space-and-time — and treat them exactly the way an LLM treats tokens: a sequence processed by a Transformer, every patch attending to every other.

This "diffusion Transformer" recipe has two consequences worth understanding:

Long-range coherence comes from attention. A patch at second 4 can attend directly to a patch at second 1 — that's the mechanism by which an object that exits the frame can come back looking the same. It's attention doing for visual continuity what it does for narrative continuity in text.
Flexibility comes free. A sequence of patches doesn't care about resolution, aspect ratio, or duration — vary the number of patches and the same model handles portrait clips, wide shots, and stills. (An image is just a video with one frame.)

The "world model" debate

Here's where engineering ends and one of AI's liveliest arguments begins.

To denoise spacetime well, a model must get an enormous amount of implicit physics right: unsupported objects fall, liquids settle flat, shadows track light sources, occluded things usually still exist. And generated video does honour these regularities — surprisingly often.

One camp argues this means video models are learning world models: internal, predictive representations of how reality behaves, and that scaling them is a road toward AI that understands physical environments — with obvious stakes for robotics and simulation.

The skeptical camp points at the failure modes: a glass that shatters before it's dropped, a person walking through a table, liquid pouring upward when the clip runs long. Their reading: the model learned what physics looks like on camera, not physics — correlation over causation, rendered beautifully.

The honest current answer is "somewhere in between, and it's genuinely unresolved." What's not disputed is the mechanism behind the glitches.

Why the glitches happen

There is no physics engine and no object registry inside a video model. Nothing tracks "glass #1, position, velocity." There are only statistical patterns pushing every patch toward local plausibility — each moment looking right given its neighbours.

That objective allows failure modes a simulator could never produce:

Glitch	The statistical reason
Objects teleport or duplicate	Nothing enforces object permanence; attention usually preserves it, sometimes doesn't
Impossible physics on long clips	Small per-step implausibilities compound with duration — the video cousin of drift
Hands passing through objects	Contact physics is rare and hard to see in training data, so it's weakly learned
Text and logos morphing mid-clip	The same compression casualties as still images, now asked to persist over time

This is exactly the plausible-over-true trade you've met twice before — hallucination in text, six-fingered hands in images — now wearing a lab coat and violating conservation of matter.

Recap

Frame-by-frame generation fails because independent samples drift; real video models denoise the entire clip as one spacetime block, making consistency structural.
The pipeline is the still-image machinery plus one dimension: spatio-temporal latents, then diffusion, steered by cross-attention.
Sora-class systems cut the block into spacetime patches processed by a Transformer — attention across space and time is where object persistence comes from, and patch counts make resolution/duration flexible.
The world-model debate — implicit physics understanding vs. very good statistical mimicry — is real and unresolved; the glitches (teleporting objects, impossible liquids) come from optimising local plausibility with no physics engine underneath.

One chapter remains: the honest accounting — artifacts, copyright, deepfakes, and where the hard limits actually are. Continue to The limits of AI image generation.