Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

AI Engineer

| Podcasts | April 21, 2026 | 4.16 Thousand views | 40:46

TL;DR

Sander Dieleman from Google DeepMind explains the technical foundations of training large-scale generative image and video models like Veo, emphasizing that meticulous data curation and learned latent representations are as critical as the diffusion architecture itself. He details how diffusion models reverse a noise corruption process through iterative refinement rather than single-step prediction.

🎯 Data Curation & Latent Representation 3 insights

Data quality investments trump architecture tweaks

Time spent improving data curation often yields better results than hyperparameter tuning or optimizer adjustments, requiring researchers to abandon the academic tradition of using predefined benchmark datasets.

Learned compressors preserve structure unlike traditional codecs

Autoencoders create latent representations that maintain topological grid structure and semantic content, whereas standard codecs like JPEG or H.265 obscure the data structure necessary for effective generative modeling.

Latent compression reduces memory by two orders of magnitude

Compressing high-resolution video into learned latent spaces reduces tensor sizes by roughly 100x, making it feasible to train on 1080p video sequences that would otherwise require several gigabytes of memory per example.

🌊 Diffusion Model Mechanics 3 insights

Reverse corruption process enables iterative generation

Diffusion models learn to reverse a gradual noise-addition process that destroys image structure, allowing generation to start from pure noise and iteratively refine toward coherent outputs.

Single-step predictions converge toward probabilistic averages

When predicting a clean image from noisy input, the model outputs the average of all possible source images, resulting in blurry intermediate predictions that indicate directional guidance rather than final results.

Sampling requires conservative steps with noise reinjection

Effective generation involves taking small steps toward the predicted clean direction followed by adding fresh random noise, a technique that prevents error accumulation from imperfect neural network predictions.

Bottom Line

For large-scale generative media projects, prioritize investments in data curation and learned latent compression before refining model architecture, as these foundational choices determine training feasibility and output quality more than incremental algorithmic improvements.

Watch on YouTube

More from AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

Meta engineers Danielle An and David Hoe argue that while AI has democratized basic game creation, true differentiation requires human taste, cohesive aesthetics powered by key art anchoring, and innovative runtime LLMs that enable unscripted, dynamically personalized gameplay experiences previously impossible in traditional development.

13 days ago · 10 points

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

AI Engineer

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

Rajiv Chandegra introduces 'adaptive engineering,' a paradigm shift from fixed AI harnesses (like Cursor or Claude Code) to dynamic, self-organizing systems that emerge during runtime, enabling AI to handle complex, real-world messes beyond deterministic software environments.

14 days ago · 9 points

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

AI Engineer

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

Aditya Bhargava argues that sophisticated agent harnesses can compensate for weaker open-source models, enabling local AI to match proprietary performance while reducing vendor dependency.

14 days ago · 9 points

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

22 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories