Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
TL;DR
Sander Dieleman from Google DeepMind explains the technical foundations of training large-scale generative image and video models like Veo, emphasizing that meticulous data curation and learned latent representations are as critical as the diffusion architecture itself. He details how diffusion models reverse a noise corruption process through iterative refinement rather than single-step prediction.
🎯 Data Curation & Latent Representation 3 insights
Data quality investments trump architecture tweaks
Time spent improving data curation often yields better results than hyperparameter tuning or optimizer adjustments, requiring researchers to abandon the academic tradition of using predefined benchmark datasets.
Learned compressors preserve structure unlike traditional codecs
Autoencoders create latent representations that maintain topological grid structure and semantic content, whereas standard codecs like JPEG or H.265 obscure the data structure necessary for effective generative modeling.
Latent compression reduces memory by two orders of magnitude
Compressing high-resolution video into learned latent spaces reduces tensor sizes by roughly 100x, making it feasible to train on 1080p video sequences that would otherwise require several gigabytes of memory per example.
🌊 Diffusion Model Mechanics 3 insights
Reverse corruption process enables iterative generation
Diffusion models learn to reverse a gradual noise-addition process that destroys image structure, allowing generation to start from pure noise and iteratively refine toward coherent outputs.
Single-step predictions converge toward probabilistic averages
When predicting a clean image from noisy input, the model outputs the average of all possible source images, resulting in blurry intermediate predictions that indicate directional guidance rather than final results.
Sampling requires conservative steps with noise reinjection
Effective generation involves taking small steps toward the predicted clean direction followed by adding fresh random noise, a technique that prevents error accumulation from imperfect neural network predictions.
Bottom Line
For large-scale generative media projects, prioritize investments in data curation and learned latent compression before refining model architecture, as these foundational choices determine training feasibility and output quality more than incremental algorithmic improvements.
More from AI Engineer
View all
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.
AI Engineer Melbourne 2026 Keynote Livestream | Day 2
Jeremy Howard argues that AI coding tools risk trapping developers in addictive 'dark flow' states that diminish psychological well-being, drawing on Self-Determination Theory to advocate for intentional AI use that augments human mastery and autonomy rather than outsourcing complexity.
How to talk to statues — Joe Reeve, ElevenLabs
Joe Reeve from ElevenLabs discusses building a viral AI app that lets users talk to statues via phone calls, exploring how vibe coding with existing APIs enables rapid prototyping, the unique challenges of voice interface design, and the cultural implications of giving physical objects AI-generated voices.
How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS
Nick Nisi from WorkOS explains how deleting 95% of his AI agent's skills improved accuracy from 77% to 97%, detailing his 'Case' harness system that uses state machines and cryptographic proof to enforce accountability rather than relying on instructions.