Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

| Podcasts | April 21, 2026 | 1.48 Thousand views | 40:46

TL;DR

Sander Dieleman from Google DeepMind explains the technical foundations of training large-scale generative image and video models like Veo, emphasizing that meticulous data curation and learned latent representations are as critical as the diffusion architecture itself. He details how diffusion models reverse a noise corruption process through iterative refinement rather than single-step prediction.

🎯 Data Curation & Latent Representation 3 insights

Data quality investments trump architecture tweaks

Time spent improving data curation often yields better results than hyperparameter tuning or optimizer adjustments, requiring researchers to abandon the academic tradition of using predefined benchmark datasets.

Learned compressors preserve structure unlike traditional codecs

Autoencoders create latent representations that maintain topological grid structure and semantic content, whereas standard codecs like JPEG or H.265 obscure the data structure necessary for effective generative modeling.

Latent compression reduces memory by two orders of magnitude

Compressing high-resolution video into learned latent spaces reduces tensor sizes by roughly 100x, making it feasible to train on 1080p video sequences that would otherwise require several gigabytes of memory per example.

🌊 Diffusion Model Mechanics 3 insights

Reverse corruption process enables iterative generation

Diffusion models learn to reverse a gradual noise-addition process that destroys image structure, allowing generation to start from pure noise and iteratively refine toward coherent outputs.

Single-step predictions converge toward probabilistic averages

When predicting a clean image from noisy input, the model outputs the average of all possible source images, resulting in blurry intermediate predictions that indicate directional guidance rather than final results.

Sampling requires conservative steps with noise reinjection

Effective generation involves taking small steps toward the predicted clean direction followed by adding fresh random noise, a technique that prevents error accumulation from imperfect neural network predictions.

Bottom Line

For large-scale generative media projects, prioritize investments in data curation and learned latent compression before refining model architecture, as these foundational choices determine training feasibility and output quality more than incremental algorithmic improvements.

More from AI Engineer

View all