Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures

| Podcasts | May 11, 2026 | 10.7 Thousand views | 1:46:26

TL;DR

This lecture transitions from theoretical foundations to practical architecture design for diffusion models, explaining how U-Net structures leverage convolutional inductive biases, hierarchical downsampling for global context, and skip connections to preserve local details while maintaining strict dimensional requirements for iterative denoising.

🎯 Generation Model Requirements 3 insights

Three essential inputs

The generation model processes a noisy latent representation XT, a timestep t indicating noise level, and a condition c (text or image) to guide the generation process toward specific outputs.

Velocity prediction output

The model predicts a velocity vector field (or equivalently noise/score) that must share identical dimensions with the input to satisfy the iterative denoising update equation xt+dt = xt + v*dt.

Dual-scale understanding

Effective architectures must simultaneously capture global image structure for coherence and fine local details for crispness while remaining scalable to high resolutions.

🔍 Convolutional Inductive Biases 3 insights

Human-like scanning bias

Convolution operations impose an inductive bias where learnable filters scan across spatial dimensions, extracting local visual features like edges and textures similar to human visual processing.

Receptive field limitations

Standard convolutions have limited receptive fields where early layers see only nearby pixels, preventing global context understanding without prohibitively deep stacking.

Hierarchical downsampling solution

Pooling operations reduce spatial dimensions while exponentially expanding the receptive field, allowing deeper layers to understand global image structure efficiently.

🏗️ The U-Net Architecture 3 insights

Encoder-decoder structure

The U-Net employs an encoder path (convolutions and pooling) to compress the image into a bottleneck representation with global context, followed by a decoder path using transpose convolutions to restore original dimensions.

Skip connections preserve detail

Direct concatenation of encoder feature maps to corresponding decoder layers transports local details that are lost during downsampling, enabling the generation of crisp, high-fidelity outputs.

Distinction from autoencoders

Unlike VAEs that compress to latent spaces for reconstruction, diffusion U-Nets predict denoising directions (velocity/noise) and must maintain strict input-output dimensional consistency for the iterative sampling process.

Bottom Line

Diffusion models rely on U-Net architectures that balance global context acquisition through hierarchical downsampling with local detail preservation via skip connections, ensuring dimensional consistency for iterative denoising updates.

More from Stanford Online

View all
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories
49:48
Stanford Online Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

9 days ago · 9 points
Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything
41:10
Stanford Online Stanford Online

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Sam Altman explains how AI has fundamentally altered startup economics, enabling small teams to achieve unprecedented scale, while sharing OpenAI's journey from research lab to product company and arguing that pushing systems beyond conventional scaling limits often reveals emergent properties that consensus thinking misses.

11 days ago · 10 points