Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

| Podcasts | April 21, 2026 | 3.96 Thousand views | 40:46

TL;DR

Sander Dieleman from Google DeepMind explains the technical foundations of training large-scale generative image and video models like Veo, emphasizing that meticulous data curation and learned latent representations are as critical as the diffusion architecture itself. He details how diffusion models reverse a noise corruption process through iterative refinement rather than single-step prediction.

🎯 Data Curation & Latent Representation 3 insights

Data quality investments trump architecture tweaks

Time spent improving data curation often yields better results than hyperparameter tuning or optimizer adjustments, requiring researchers to abandon the academic tradition of using predefined benchmark datasets.

Learned compressors preserve structure unlike traditional codecs

Autoencoders create latent representations that maintain topological grid structure and semantic content, whereas standard codecs like JPEG or H.265 obscure the data structure necessary for effective generative modeling.

Latent compression reduces memory by two orders of magnitude

Compressing high-resolution video into learned latent spaces reduces tensor sizes by roughly 100x, making it feasible to train on 1080p video sequences that would otherwise require several gigabytes of memory per example.

🌊 Diffusion Model Mechanics 3 insights

Reverse corruption process enables iterative generation

Diffusion models learn to reverse a gradual noise-addition process that destroys image structure, allowing generation to start from pure noise and iteratively refine toward coherent outputs.

Single-step predictions converge toward probabilistic averages

When predicting a clean image from noisy input, the model outputs the average of all possible source images, resulting in blurry intermediate predictions that indicate directional guidance rather than final results.

Sampling requires conservative steps with noise reinjection

Effective generation involves taking small steps toward the predicted clean direction followed by adding fresh random noise, a technique that prevents error accumulation from imperfect neural network predictions.

Bottom Line

For large-scale generative media projects, prioritize investments in data curation and learned latent compression before refining model architecture, as these foundational choices determine training feasibility and output quality more than incremental algorithmic improvements.

More from AI Engineer

View all
Text Diffusion — Brendon Dillon, Google DeepMind
AI Engineer AI Engineer

Text Diffusion — Brendon Dillon, Google DeepMind

Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.

1 day ago · 8 points
AI Engineer Melbourne 2026 Keynote Livestream | Day 2
1:05:31
AI Engineer AI Engineer

AI Engineer Melbourne 2026 Keynote Livestream | Day 2

Jeremy Howard argues that AI coding tools risk trapping developers in addictive 'dark flow' states that diminish psychological well-being, drawing on Self-Determination Theory to advocate for intentional AI use that augments human mastery and autonomy rather than outsourcing complexity.

2 days ago · 9 points
How to talk to statues — Joe Reeve, ElevenLabs
33:28
AI Engineer AI Engineer

How to talk to statues — Joe Reeve, ElevenLabs

Joe Reeve from ElevenLabs discusses building a viral AI app that lets users talk to statues via phone calls, exploring how vibe coding with existing APIs enables rapid prototyping, the unique challenges of voice interface design, and the cultural implications of giving physical objects AI-generated voices.

5 days ago · 9 points