Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 5: GPUs, TPUs
TL;DR
This lecture introduces GPU architecture for language model training, explaining the shift from serial CPU execution to parallel GPU throughput, the critical importance of memory hierarchies, and the SIMT programming model essential for efficient deep learning systems.
📉 The End of Serial Scaling 3 insights
Dennard scaling collapse forced parallel architectures
CPU clock speeds stopped increasing around 2005 due to fundamental physical limits on transistor scaling, forcing the industry to abandon faster serial execution in favor of horizontal GPU parallelism.
GPUs maximize throughput not latency
Unlike CPUs designed for fast serial execution with complex branching, GPUs utilize thousands of lightweight cores that accept higher individual task latency to maximize aggregate floating-point operations per second.
Super-exponential FLOP growth since 2017
Beginning with P100 and V100 chips, GPU compute capacity scaled dramatically through hardware innovations including tensor cores, structured sparsity, and reduced precision formats like FP8.
🧩 GPU Hardware Architecture 3 insights
Streaming Multiprocessors are independent compute units
Modern GPUs like the A100 contain 128 Streaming Multiprocessors, each acting as a discrete core with internal streaming processors that execute threads in parallel and maintain dedicated access to fast local memory.
Memory distance creates 20x latency penalties
Global memory resides physically distant from compute chips with approximately 400 cycle latency, while L1 and shared memory live inside Streaming Multiprocessors with only 20-30 cycle access speeds.
Shared memory enables fast thread cooperation
Located within each SM, shared memory allows threads within a block to communicate and reuse data rapidly, though it costs hundreds of times more per byte than global DRAM and consumes significantly more power.
⚙️ Programming Model and Execution 3 insights
SIMT architecture requires lockstep warp execution
GPUs execute threads in 32-thread warps following Single Instruction Multiple Thread principles, meaning all threads in a warp must execute identical instructions simultaneously on different data inputs.
Blocks guarantee SM residency for memory sharing
Thread blocks are scheduling units guaranteed to execute on a single Streaming Multiprocessor, granting them exclusive access to that SM's fast shared memory pool for inter-thread data reuse.
Warps serve as the hardware scheduling unit
The GPU scheduler dispatches and manages execution in groups of 32 threads called warps rather than individual threads, reducing scheduling overhead but creating divergence penalties when threads take different code paths.
🚀 Systems-Aware Optimization 3 insights
Hardware knowledge enables efficient model design
Understanding GPU execution models is essential for architecture design because efficient scaling requires maximizing resource utilization through hardware-aware algorithm choices rather than just theoretical compute counts.
Throughput varies non-linearly with matrix dimensions
GPU performance on matrix multiplication exhibits complex patterns where specific matrix sizes achieve dramatically higher throughput than others due to intricate interactions between memory hierarchies and compute units.
Flash Attention demonstrates hardware optimization
The lecture presents Flash Attention as a synthesis of GPU techniques including tiling and careful memory management, demonstrating how deep hardware knowledge enables algorithmic breakthroughs in transformer inference.
Bottom Line
Maximize language model training efficiency by minimizing slow global memory accesses and maximizing data reuse within the fast but limited shared memory and registers of GPU Streaming Multiprocessors.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.