Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
TL;DR
This lecture introduces score matching as an alternative to DDPM for generative modeling, where instead of predicting noise directly, models estimate the gradient of log probability density (the 'score') to guide sampling from noise toward data distributions using Langevin dynamics.
📐 The Score Function Fundamentals 3 insights
Score defined as gradient of log probability
The score is defined as ∇ₓ log p(x), representing the direction of steepest ascent toward regions of higher probability density in the data space.
Log probability eliminates normalizing constants
Unlike ∇p(x) which requires knowing the intractable normalizing constant Z, the gradient of log p(x) eliminates Z entirely since ∇ log Z = 0.
Numerical stability in low-density regions
Dividing the gradient by the probability density (as in ∇ log p = ∇p/p) prevents numerical instability when p(x) takes very small values in sparse regions of the data space.
🎯 Sampling via Langevin Dynamics 3 insights
Following scores leads to high-density regions
Starting from random noise, iteratively following the score direction moves samples toward regions of higher probability under the data distribution.
Stochastic sampling ensures diversity
Langevin sampling adds a stochastic noise term to the score-following process, preventing mode collapse and ensuring exploration of the full distribution rather than just its highest-density points.
MCMC method with theoretical guarantees
Langevin dynamics is a Markov Chain Monte Carlo method derived from the Fokker-Planck equation that converges to the true data distribution when the score is known accurately.
🧮 Denoising Score Matching Training 3 insights
Unknown true score requires approximation
Since the true data distribution p_data is unknown, the score cannot be computed directly, necessitating methods to estimate it without explicit knowledge of the density.
Gaussian perturbations provide tractable targets
Adding Gaussian noise to data creates a perturbed distribution q_σ(x̃|x) with an analytically known score equal to -(x̃ - x)/σ², enabling supervised learning.
L2 loss on score predictions
The model s_θ is trained to minimize the squared error between its predicted score and the true score of the noised data, effectively learning a denoising score function.
Bottom Line
Train models to estimate the score function (gradient of log probability) of progressively noised data using denoising score matching, then use Langevin dynamics to sample from noise toward the data distribution without computing intractable normalizing constants.
More from Stanford Online
View all
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
This Stanford lecture establishes aesthetics and prompt adherence as the dual pillars for evaluating text-to-image models, compares human evaluation methods from noisy absolute ratings to reliable pairwise comparisons, and details the ELO rating system for robust model benchmarking before addressing the scalability crisis that necessitates automated metrics.
Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required
Former federal prosecutor and tech security chief Joe Sullivan recounts his journey from prosecuting cybercrime to leading security at eBay, Facebook, Uber, and Cloudflare, sharing hard-won lessons on the critical importance of transparency in security incidents through the lens of his personal prosecution for the 2016 Uber data breach cover-up.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.