Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training

| Podcasts | May 19, 2026 | 4.52 Thousand views | 1:40:58

TL;DR

This lecture covers practical training of text-to-image diffusion models, detailing the evolution from UNet to Diffusion Transformer architectures, the three-stage training pipeline (pre-training, post-training, and tuning), and critical optimizations including flow matching loss functions, logit-normal time step sampling, and resolution-aware noise scheduling.

🏗️ Architecture Evolution & Training Pipeline 3 insights

From UNet to Diffusion Transformers

While UNet architectures dominated through 2022 using convolutional inductive biases for local-global hierarchies, they failed to connect distant spatial regions, whereas Diffusion Transformers leverage self-attention to enable direct relationships between any image patches, solving limitations like mirror reflections.

Three-Stage Production Pipeline

Developing production-ready models involves compute-intensive pre-training for general generation, post-training for quality refinement, optional domain-specific tuning, and finally distillation to reduce iterative sampling steps and minimize inference latency.

Multimodal Conditioning Advances

Modern MMDiT architectures treat text as a standalone input modality rather than injecting it uniformly via adaptive layer normalization, avoiding blanket modulation across all patch embeddings and enabling more precise generative control.

🎯 Loss Functions & Flow Matching 2 insights

Three Theoretical Perspectives

Training objectives derive from DDPM noise prediction, score matching for distribution gradients, and flow matching for velocity fields, with flow matching now serving as the industry default due to its direct regression approach and training stability.

Flow Matching Implementation

The standard loss minimizes the L2 distance between predicted and target vector fields, sampling time steps, training data, and Gaussian noise to learn the optimal transport path from latent noise to clean images.

⏱️ Time Step Sampling Optimizations 2 insights

Logit-Normal Distribution Sampling

Practitioners replace uniform time step sampling with logit-normal distributions to emphasize middle noise levels (t≈0.5) where structural decisions are hardest, while de-emphasizing trivial early steps (predicting dataset means from pure noise) and late steps (minor detail refinement).

Resolution-Aware Noise Shifting

Low-resolution images appear noisier than high-resolution images at identical noise levels due to spatial pixel correlation, necessitating time step rescaling functions that adjust noise schedules based on image resolution to maintain consistent perceived difficulty.

Bottom Line

Replace uniform time step sampling with logit-normal distributions emphasizing middle noise levels (t≈0.5) and implement resolution-aware noise shifting to maximize training efficiency and model performance in diffusion model development.

More from Stanford Online

View all
Stanford MS&E435 | Spring 2026 | Economics of Generative AI
34:13
Stanford Online Stanford Online

Stanford MS&E435 | Spring 2026 | Economics of Generative AI

Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.

2 days ago · 7 points
Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
1:11:12
Stanford Online Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy

UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.

2 days ago · 8 points