Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training

Stanford Online

| Podcasts | May 19, 2026 | 11.2 Thousand views | 1:40:58

TL;DR

This lecture covers practical training of text-to-image diffusion models, detailing the evolution from UNet to Diffusion Transformer architectures, the three-stage training pipeline (pre-training, post-training, and tuning), and critical optimizations including flow matching loss functions, logit-normal time step sampling, and resolution-aware noise scheduling.

🏗️ Architecture Evolution & Training Pipeline 3 insights

From UNet to Diffusion Transformers

While UNet architectures dominated through 2022 using convolutional inductive biases for local-global hierarchies, they failed to connect distant spatial regions, whereas Diffusion Transformers leverage self-attention to enable direct relationships between any image patches, solving limitations like mirror reflections.

Three-Stage Production Pipeline

Developing production-ready models involves compute-intensive pre-training for general generation, post-training for quality refinement, optional domain-specific tuning, and finally distillation to reduce iterative sampling steps and minimize inference latency.

Multimodal Conditioning Advances

Modern MMDiT architectures treat text as a standalone input modality rather than injecting it uniformly via adaptive layer normalization, avoiding blanket modulation across all patch embeddings and enabling more precise generative control.

🎯 Loss Functions & Flow Matching 2 insights

Three Theoretical Perspectives

Training objectives derive from DDPM noise prediction, score matching for distribution gradients, and flow matching for velocity fields, with flow matching now serving as the industry default due to its direct regression approach and training stability.

Flow Matching Implementation

The standard loss minimizes the L2 distance between predicted and target vector fields, sampling time steps, training data, and Gaussian noise to learn the optimal transport path from latent noise to clean images.

⏱️ Time Step Sampling Optimizations 2 insights

Logit-Normal Distribution Sampling

Practitioners replace uniform time step sampling with logit-normal distributions to emphasize middle noise levels (t≈0.5) where structural decisions are hardest, while de-emphasizing trivial early steps (predicting dataset means from pure noise) and late steps (minor detail refinement).

Resolution-Aware Noise Shifting

Low-resolution images appear noisier than high-resolution images at identical noise levels due to spatial pixel correlation, necessitating time step rescaling functions that adjust noise schedules based on image resolution to maintain consistent perceived difficulty.

Bottom Line

Replace uniform time step sampling with logit-normal distributions emphasizing middle noise levels (t≈0.5) and implement resolution-aware noise shifting to maximize training efficiency and model performance in diffusion model development.

Watch on YouTube

More from Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

13 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

20 days ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

21 days ago · 10 points

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Stanford Online

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Sam Altman explains how AI has fundamentally altered startup economics, enabling small teams to achieve unprecedented scale, while sharing OpenAI's journey from research lab to product company and arguing that pushing systems beyond conventional scaling limits often reveals emergent properties that consensus thinking misses.

22 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories