Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
TL;DR
This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.
💾 GPU Memory Hierarchy 2 insights
HBM scales while on-chip memory remains constant
Modern NVIDIA GPUs (B200s) maintain roughly 100-200 SMs with ~256KB shared memory per SM, but High Bandwidth Memory (HBM) capacity grows significantly, widening the speed gap between fast local memory and slow global memory.
Registers and shared memory minimize HBM access
Fast on-chip memory (registers and L1/shared memory) local to each SM enables threads to communicate and compute without expensive round-trips to high-latency HBM, which is the primary bottleneck in kernel optimization.
⚙️ Programming Model vs. Hardware 3 insights
Thread blocks map to SMs for shared memory access
While the programming model presents a clean hierarchy of threads, thread blocks, and grids, thread blocks specifically schedule onto Streaming Multiprocessors to enable fast communication via shared memory rather than slow HBM.
Warps execute in lockstep causing divergence penalties
Threads are grouped into 32-thread warps that must execute identical instructions simultaneously, causing serialized execution when threads take different branches (control divergence), which severely reduces throughput.
Warp scheduling hides latency via zero-cost switching
SMs maintain multiple resident warps and switch between them instantly when one stalls waiting for HBM, allowing the hardware to hide memory latency by keeping compute units busy with other warps.
⚡ Performance Optimization Constraints 4 insights
Register usage limits thread occupancy
Each thread can use at most 255 registers, meaning threads using many registers reduce the total number of concurrent threads (occupancy), though fewer threads doing more work via thread coarsening can sometimes improve efficiency.
Shared memory bank conflicts serialize access
Shared memory is divided into 32 banks, and when multiple threads in a warp access the same bank simultaneously, the hardware must serialize these accesses, creating delays that require swizzling techniques to mitigate.
Uncoalesced global memory wastes bandwidth
When threads in a warp access scattered HBM locations rather than contiguous cache lines (128 bytes), the hardware executes multiple memory transactions instead of one, drastically reducing effective bandwidth utilization.
Block quantization causes SM underutilization
Since thread blocks cannot split across SMs, launching a block count that doesn't evenly divide the total SM count (e.g., 160 blocks on 148 SMs) leaves some SMs idle during the final execution wave.
Bottom Line
Design GPU kernels to maximize data reuse in fast shared memory and registers while ensuring your total thread block count evenly divides the GPU's SM count to prevent idle hardware.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.