Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 5: GPUs, TPUs
TL;DR
This lecture introduces GPU architecture for language model training, explaining the shift from serial CPU execution to parallel GPU throughput, the critical importance of memory hierarchies, and the SIMT programming model essential for efficient deep learning systems.
📉 The End of Serial Scaling 3 insights
Dennard scaling collapse forced parallel architectures
CPU clock speeds stopped increasing around 2005 due to fundamental physical limits on transistor scaling, forcing the industry to abandon faster serial execution in favor of horizontal GPU parallelism.
GPUs maximize throughput not latency
Unlike CPUs designed for fast serial execution with complex branching, GPUs utilize thousands of lightweight cores that accept higher individual task latency to maximize aggregate floating-point operations per second.
Super-exponential FLOP growth since 2017
Beginning with P100 and V100 chips, GPU compute capacity scaled dramatically through hardware innovations including tensor cores, structured sparsity, and reduced precision formats like FP8.
🧩 GPU Hardware Architecture 3 insights
Streaming Multiprocessors are independent compute units
Modern GPUs like the A100 contain 128 Streaming Multiprocessors, each acting as a discrete core with internal streaming processors that execute threads in parallel and maintain dedicated access to fast local memory.
Memory distance creates 20x latency penalties
Global memory resides physically distant from compute chips with approximately 400 cycle latency, while L1 and shared memory live inside Streaming Multiprocessors with only 20-30 cycle access speeds.
Shared memory enables fast thread cooperation
Located within each SM, shared memory allows threads within a block to communicate and reuse data rapidly, though it costs hundreds of times more per byte than global DRAM and consumes significantly more power.
⚙️ Programming Model and Execution 3 insights
SIMT architecture requires lockstep warp execution
GPUs execute threads in 32-thread warps following Single Instruction Multiple Thread principles, meaning all threads in a warp must execute identical instructions simultaneously on different data inputs.
Blocks guarantee SM residency for memory sharing
Thread blocks are scheduling units guaranteed to execute on a single Streaming Multiprocessor, granting them exclusive access to that SM's fast shared memory pool for inter-thread data reuse.
Warps serve as the hardware scheduling unit
The GPU scheduler dispatches and manages execution in groups of 32 threads called warps rather than individual threads, reducing scheduling overhead but creating divergence penalties when threads take different code paths.
🚀 Systems-Aware Optimization 3 insights
Hardware knowledge enables efficient model design
Understanding GPU execution models is essential for architecture design because efficient scaling requires maximizing resource utilization through hardware-aware algorithm choices rather than just theoretical compute counts.
Throughput varies non-linearly with matrix dimensions
GPU performance on matrix multiplication exhibits complex patterns where specific matrix sizes achieve dramatically higher throughput than others due to intricate interactions between memory hierarchies and compute units.
Flash Attention demonstrates hardware optimization
The lecture presents Flash Attention as a synthesis of GPU techniques including tiling and careful memory management, demonstrating how deep hardware knowledge enables algorithmic breakthroughs in transformer inference.
Bottom Line
Maximize language model training efficiency by minimizing slow global memory accesses and maximizing data reuse within the fast but limited shared memory and registers of GPU Streaming Multiprocessors.
More from Stanford Online
View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Robot Learning from Human Experience
This seminar presents a paradigm shift in robot learning by replacing teleoperation with direct capture of human egocentric experience using wearable sensors, demonstrating that scaling human data—combined with alignment techniques like optimal transport—enables dramatic performance gains and zero-shot task transfer to robots.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 3 - Flow matching
This lecture introduces flow matching as a third paradigm for generative modeling, explaining how it deterministically transports probability distributions from Gaussian noise to data through learned vector fields, while contrasting its velocity-based mechanics with diffusion and score matching approaches.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures
This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.
Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well
Tracy Fullerton presents a framework for understanding games not as static technical artifacts but as ephemeral emotional events created through the player's unique encounter with the work, introducing 'readings' as a method to capture and value these personal experiences with the same critical depth applied to literature and film.