Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
TL;DR
This lecture introduces score matching as an alternative to DDPM for generative modeling, where instead of predicting noise directly, models estimate the gradient of log probability density (the 'score') to guide sampling from noise toward data distributions using Langevin dynamics.
📐 The Score Function Fundamentals 3 insights
Score defined as gradient of log probability
The score is defined as ∇ₓ log p(x), representing the direction of steepest ascent toward regions of higher probability density in the data space.
Log probability eliminates normalizing constants
Unlike ∇p(x) which requires knowing the intractable normalizing constant Z, the gradient of log p(x) eliminates Z entirely since ∇ log Z = 0.
Numerical stability in low-density regions
Dividing the gradient by the probability density (as in ∇ log p = ∇p/p) prevents numerical instability when p(x) takes very small values in sparse regions of the data space.
🎯 Sampling via Langevin Dynamics 3 insights
Following scores leads to high-density regions
Starting from random noise, iteratively following the score direction moves samples toward regions of higher probability under the data distribution.
Stochastic sampling ensures diversity
Langevin sampling adds a stochastic noise term to the score-following process, preventing mode collapse and ensuring exploration of the full distribution rather than just its highest-density points.
MCMC method with theoretical guarantees
Langevin dynamics is a Markov Chain Monte Carlo method derived from the Fokker-Planck equation that converges to the true data distribution when the score is known accurately.
🧮 Denoising Score Matching Training 3 insights
Unknown true score requires approximation
Since the true data distribution p_data is unknown, the score cannot be computed directly, necessitating methods to estimate it without explicit knowledge of the density.
Gaussian perturbations provide tractable targets
Adding Gaussian noise to data creates a perturbed distribution q_σ(x̃|x) with an analytically known score equal to -(x̃ - x)/σ², enabling supervised learning.
L2 loss on score predictions
The model s_θ is trained to minimize the squared error between its predicted score and the true score of the noised data, effectively learning a denoising score function.
Bottom Line
Train models to estimate the score function (gradient of log probability) of progressively noised data using denoising score matching, then use Langevin dynamics to sample from noise toward the data distribution without computing intractable normalizing constants.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)
This lecture covers resource accounting fundamentals for training large language models, including FLOPs calculations and memory estimation, explores numerical precision trade-offs from FP32 down to FP4, and introduces einops as a readable alternative to PyTorch tensor operations using named dimensions.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization
This lecture introduces Stanford CS336's philosophy of building language models from scratch to understand fundamentals rather than relying on abstractions, addressing how researchers can navigate the disconnect caused by industrialized, closed frontier models by focusing on transferable mechanics and efficiency-minded mindsets.
Stanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox
This seminar reframes Moravec's Paradox through control theory, demonstrating why robot learning suffers from exponential compounding errors that symbolic tasks avoid, and identifies action chunking and generative control policies as the essential algorithmic breakthroughs that enabled the 2023 inflection point in robotic manipulation capabilities.
Stanford CS547 HCI Seminar | Winter 2026 | Computational Ecosystems
The speaker argues that to solve persistent human problems in HCI, designers must move beyond building better tools and instead critically reimagine entire socio-technical ecosystems. Through examples in event planning, crowdsourcing, social connection, and education, he demonstrates how redesigning human practices—what he terms "critical technical practice"—can unlock values that pure technological advancement has failed to address.