Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
TL;DR
This final lecture synthesizes the evolution of generative modeling from discrete diffusion to continuous flow matching, emphasizing that by 2026 flow matching—specifically rectified flow variants—has become the industry default for efficient image generation.
🔄 Core Generation Paradigms 3 insights
Diffusion predicts noise to reverse corruption
Diffusion models learn a reverse process by minimizing an L2 loss that estimates Gaussian noise added to images, derived from an evidence lower bound (ELBO) on the data likelihood.
Score matching estimates data distribution gradients
Score-based methods compute the gradient of log probability to navigate from noise to data using Langevin dynamics, avoiding intractable normalizing constants while revealing that the score equals negative noise divided by a coefficient.
Flow matching reframes generation as mass transport
Flow matching treats generation as moving probability mass via vector fields (velocities) from a prior to a target distribution, governed by the continuity equation and numerically solved as an ODE.
📈 Continuous Formulations 3 insights
SDEs unify discrete approaches
Stochastic differential equations generalize discrete noising into continuous forward processes with drift and diffusion terms, where DDPM represents variance-preserving and score networks represent variance-exploding formulations.
Reverse processes require score estimation
The reverse-time SDE depends on the score function, meaning models must estimate this quantity to transform noise back into clean images through either stochastic or deterministic trajectories.
Rectified flow enables faster inference
Rectified flow variants straighten probability paths between distributions, allowing high-quality image generation with significantly fewer numerical integration steps than traditional curved diffusion trajectories.
🎯 Latent Representation and Control 3 insights
VAEs compress and structure latent spaces
Variational autoencoders reduce high-dimensional pixel redundancy into compact latent representations by combining reconstruction loss with KL divergence regularization toward a prior distribution.
Classifier-free guidance strengthens alignment
Guidance techniques interpolate between conditional and unconditional model predictions during inference to enhance alignment between generated images and text prompts without requiring separate classifier training.
Multi-modal encoders bridge vision and language
Architectures like CLIP use contrastive learning to align image and text representations in shared spaces, enabling effective conditioning for text-to-image generation systems.
Bottom Line
Master flow matching with rectified flow paths, as it has become the dominant paradigm for image generation by 2026, offering superior inference efficiency compared to traditional diffusion or score-based methods.
More from Stanford Online
View all
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
This Stanford lecture establishes aesthetics and prompt adherence as the dual pillars for evaluating text-to-image models, compares human evaluation methods from noisy absolute ratings to reliable pairwise comparisons, and details the ELO rating system for robust model benchmarking before addressing the scalability crisis that necessitates automated metrics.
Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required
Former federal prosecutor and tech security chief Joe Sullivan recounts his journey from prosecuting cybercrime to leading security at eBay, Facebook, Uber, and Cloudflare, sharing hard-won lessons on the critical importance of transparency in security incidents through the lens of his personal prosecution for the 2016 Uber data breach cover-up.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.