Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures
TL;DR
This lecture transitions from theoretical foundations to practical architecture design for diffusion models, explaining how U-Net structures leverage convolutional inductive biases, hierarchical downsampling for global context, and skip connections to preserve local details while maintaining strict dimensional requirements for iterative denoising.
🎯 Generation Model Requirements 3 insights
Three essential inputs
The generation model processes a noisy latent representation XT, a timestep t indicating noise level, and a condition c (text or image) to guide the generation process toward specific outputs.
Velocity prediction output
The model predicts a velocity vector field (or equivalently noise/score) that must share identical dimensions with the input to satisfy the iterative denoising update equation xt+dt = xt + v*dt.
Dual-scale understanding
Effective architectures must simultaneously capture global image structure for coherence and fine local details for crispness while remaining scalable to high resolutions.
🔍 Convolutional Inductive Biases 3 insights
Human-like scanning bias
Convolution operations impose an inductive bias where learnable filters scan across spatial dimensions, extracting local visual features like edges and textures similar to human visual processing.
Receptive field limitations
Standard convolutions have limited receptive fields where early layers see only nearby pixels, preventing global context understanding without prohibitively deep stacking.
Hierarchical downsampling solution
Pooling operations reduce spatial dimensions while exponentially expanding the receptive field, allowing deeper layers to understand global image structure efficiently.
🏗️ The U-Net Architecture 3 insights
Encoder-decoder structure
The U-Net employs an encoder path (convolutions and pooling) to compress the image into a bottleneck representation with global context, followed by a decoder path using transpose convolutions to restore original dimensions.
Skip connections preserve detail
Direct concatenation of encoder feature maps to corresponding decoder layers transports local details that are lost during downsampling, enabling the generation of crisp, high-fidelity outputs.
Distinction from autoencoders
Unlike VAEs that compress to latent spaces for reconstruction, diffusion U-Nets predict denoising directions (velocity/noise) and must maintain strict input-output dimensional consistency for the iterative sampling process.
Bottom Line
Diffusion models rely on U-Net architectures that balance global context acquisition through hierarchical downsampling with local detail preservation via skip connections, ensuring dimensional consistency for iterative denoising updates.
More from Stanford Online
View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
This seminar outlines a roadmap for autonomous surgical robotics to address critical healthcare labor shortages, proposing a physics-based approach built on four pillars—perception, modeling, planning, and control—that achieves sub-2mm precision through real-time digital twinning rather than relying on data-scarce foundation models.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Inference now dominates AI economics, with OpenAI generating 8.6 trillion tokens daily—exceeding frontier model training compute in under four days. Unlike training, autoregressive inference cannot parallelize across sequences, making it fundamentally memory-bandwidth bound rather than compute bound, with batch sizes under 295 on H100s failing to saturate GPU capacity.
Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence
Shrimai Prabhumoye presents advanced LLM pre-training strategies from her work at Nvidia, demonstrating that curriculum learning (two-phase training) and front-loading reasoning data during pre-training create stronger foundations and durable performance gains that cannot be matched by increased compute in later stages.
Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs
Nuaman Tazzy from HuggingFace explains how to scale transformer training to thousands of GPUs using data parallelism strategies, from basic Distributed Data Parallel (DDP) to Fully Sharded Data Parallel (FSDP/ZeRO), emphasizing memory optimization techniques and the critical importance of overlapping communication with computation to keep GPUs fully utilized.