Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching

| Podcasts | April 14, 2026 | 16.9 Thousand views | 1:48:48

TL;DR

This lecture introduces score matching as an alternative to DDPM for generative modeling, where instead of predicting noise directly, models estimate the gradient of log probability density (the 'score') to guide sampling from noise toward data distributions using Langevin dynamics.

📐 The Score Function Fundamentals 3 insights

Score defined as gradient of log probability

The score is defined as ∇ₓ log p(x), representing the direction of steepest ascent toward regions of higher probability density in the data space.

Log probability eliminates normalizing constants

Unlike ∇p(x) which requires knowing the intractable normalizing constant Z, the gradient of log p(x) eliminates Z entirely since ∇ log Z = 0.

Numerical stability in low-density regions

Dividing the gradient by the probability density (as in ∇ log p = ∇p/p) prevents numerical instability when p(x) takes very small values in sparse regions of the data space.

🎯 Sampling via Langevin Dynamics 3 insights

Following scores leads to high-density regions

Starting from random noise, iteratively following the score direction moves samples toward regions of higher probability under the data distribution.

Stochastic sampling ensures diversity

Langevin sampling adds a stochastic noise term to the score-following process, preventing mode collapse and ensuring exploration of the full distribution rather than just its highest-density points.

MCMC method with theoretical guarantees

Langevin dynamics is a Markov Chain Monte Carlo method derived from the Fokker-Planck equation that converges to the true data distribution when the score is known accurately.

🧮 Denoising Score Matching Training 3 insights

Unknown true score requires approximation

Since the true data distribution p_data is unknown, the score cannot be computed directly, necessitating methods to estimate it without explicit knowledge of the density.

Gaussian perturbations provide tractable targets

Adding Gaussian noise to data creates a perturbed distribution q_σ(x̃|x) with an analytically known score equal to -(x̃ - x)/σ², enabling supervised learning.

L2 loss on score predictions

The model s_θ is trained to minimize the squared error between its predicted score and the true score of the noised data, effectively learning a denoising score function.

Bottom Line

Train models to estimate the score function (gradient of log probability) of progressively noised data using denoising score matching, then use Langevin dynamics to sample from noise toward the data distribution without computing intractable normalizing constants.

More from Stanford Online

View all
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
1:41:12
Stanford Online Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

This Stanford lecture establishes aesthetics and prompt adherence as the dual pillars for evaluating text-to-image models, compares human evaluation methods from noisy absolute ratings to reliable pairwise comparisons, and details the ELO rating system for robust model benchmarking before addressing the scalability crisis that necessitates automated metrics.

2 days ago · 10 points
Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required
1:05:19
Stanford Online Stanford Online

Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required

Former federal prosecutor and tech security chief Joe Sullivan recounts his journey from prosecuting cybercrime to leading security at eBay, Facebook, Uber, and Cloudflare, sharing hard-won lessons on the critical importance of transparency in security incidents through the lens of his personal prosecution for the 2016 Uber data breach cover-up.

2 days ago · 9 points