Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
TL;DR
Sander Dieleman from Google DeepMind explains the technical foundations of training large-scale generative image and video models like Veo, emphasizing that meticulous data curation and learned latent representations are as critical as the diffusion architecture itself. He details how diffusion models reverse a noise corruption process through iterative refinement rather than single-step prediction.
🎯 Data Curation & Latent Representation 3 insights
Data quality investments trump architecture tweaks
Time spent improving data curation often yields better results than hyperparameter tuning or optimizer adjustments, requiring researchers to abandon the academic tradition of using predefined benchmark datasets.
Learned compressors preserve structure unlike traditional codecs
Autoencoders create latent representations that maintain topological grid structure and semantic content, whereas standard codecs like JPEG or H.265 obscure the data structure necessary for effective generative modeling.
Latent compression reduces memory by two orders of magnitude
Compressing high-resolution video into learned latent spaces reduces tensor sizes by roughly 100x, making it feasible to train on 1080p video sequences that would otherwise require several gigabytes of memory per example.
🌊 Diffusion Model Mechanics 3 insights
Reverse corruption process enables iterative generation
Diffusion models learn to reverse a gradual noise-addition process that destroys image structure, allowing generation to start from pure noise and iteratively refine toward coherent outputs.
Single-step predictions converge toward probabilistic averages
When predicting a clean image from noisy input, the model outputs the average of all possible source images, resulting in blurry intermediate predictions that indicate directional guidance rather than final results.
Sampling requires conservative steps with noise reinjection
Effective generation involves taking small steps toward the predicted clean direction followed by adding fresh random noise, a technique that prevents error accumulation from imperfect neural network predictions.
Bottom Line
For large-scale generative media projects, prioritize investments in data curation and learned latent compression before refining model architecture, as these foundational choices determine training feasibility and output quality more than incremental algorithmic improvements.
More from AI Engineer
View all
Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi
This workshop demonstrates how to build production-grade deep research agents by navigating the spectrum from rigid workflows to autonomous systems, emphasizing strategic context management and hybrid architectures to automate high-quality technical content creation while avoiding generic 'AI slop'.
Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI
OpenAI engineer Ryan Lopopolo explains how AI agents have made code generation abundant and free, requiring a fundamental shift from writing implementations to designing systems, documentation, and guardrails that enable agents to execute software engineering tasks autonomously.
AI Didn’t Kill the Web, It Moved in! — Olivier Leplus (AWS) & Yohan Lasorsa (Microsoft)
AI has permeated every stage of the web development lifecycle, from coding agents that use customizable "skills" to browser-native AI debugging tools, fundamentally shifting how developers build, test, and optimize applications.
How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
Joel Becker from METR argues that slowing compute growth would proportionally delay AI capabilities milestones measured by task time horizons, while presenting findings that experienced open-source developers showed minimal productivity gains from AI coding assistants like Cursor, challenging optimistic adoption curves.