Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 5: GPUs, TPUs

| Podcasts | April 20, 2026 | 2.78 Thousand views | 1:18:39

TL;DR

This lecture introduces GPU architecture for language model training, explaining the shift from serial CPU execution to parallel GPU throughput, the critical importance of memory hierarchies, and the SIMT programming model essential for efficient deep learning systems.

📉 The End of Serial Scaling 3 insights

Dennard scaling collapse forced parallel architectures

CPU clock speeds stopped increasing around 2005 due to fundamental physical limits on transistor scaling, forcing the industry to abandon faster serial execution in favor of horizontal GPU parallelism.

GPUs maximize throughput not latency

Unlike CPUs designed for fast serial execution with complex branching, GPUs utilize thousands of lightweight cores that accept higher individual task latency to maximize aggregate floating-point operations per second.

Super-exponential FLOP growth since 2017

Beginning with P100 and V100 chips, GPU compute capacity scaled dramatically through hardware innovations including tensor cores, structured sparsity, and reduced precision formats like FP8.

🧩 GPU Hardware Architecture 3 insights

Streaming Multiprocessors are independent compute units

Modern GPUs like the A100 contain 128 Streaming Multiprocessors, each acting as a discrete core with internal streaming processors that execute threads in parallel and maintain dedicated access to fast local memory.

Memory distance creates 20x latency penalties

Global memory resides physically distant from compute chips with approximately 400 cycle latency, while L1 and shared memory live inside Streaming Multiprocessors with only 20-30 cycle access speeds.

Shared memory enables fast thread cooperation

Located within each SM, shared memory allows threads within a block to communicate and reuse data rapidly, though it costs hundreds of times more per byte than global DRAM and consumes significantly more power.

⚙️ Programming Model and Execution 3 insights

SIMT architecture requires lockstep warp execution

GPUs execute threads in 32-thread warps following Single Instruction Multiple Thread principles, meaning all threads in a warp must execute identical instructions simultaneously on different data inputs.

Blocks guarantee SM residency for memory sharing

Thread blocks are scheduling units guaranteed to execute on a single Streaming Multiprocessor, granting them exclusive access to that SM's fast shared memory pool for inter-thread data reuse.

Warps serve as the hardware scheduling unit

The GPU scheduler dispatches and manages execution in groups of 32 threads called warps rather than individual threads, reducing scheduling overhead but creating divergence penalties when threads take different code paths.

🚀 Systems-Aware Optimization 3 insights

Hardware knowledge enables efficient model design

Understanding GPU execution models is essential for architecture design because efficient scaling requires maximizing resource utilization through hardware-aware algorithm choices rather than just theoretical compute counts.

Throughput varies non-linearly with matrix dimensions

GPU performance on matrix multiplication exhibits complex patterns where specific matrix sizes achieve dramatically higher throughput than others due to intricate interactions between memory hierarchies and compute units.

Flash Attention demonstrates hardware optimization

The lecture presents Flash Attention as a synthesis of GPU techniques including tiling and careful memory management, demonstrating how deep hardware knowledge enables algorithmic breakthroughs in transformer inference.

Bottom Line

Maximize language model training efficiency by minimizing slow global memory accesses and maximizing data reuse within the fast but limited shared memory and registers of GPU Streaming Multiprocessors.

More from Stanford Online

View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures
1:29:14
Stanford Online Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.

6 days ago · 8 points
Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well
59:42
Stanford Online Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well

Tracy Fullerton presents a framework for understanding games not as static technical artifacts but as ephemeral emotional events created through the player's unique encounter with the work, introducing 'readings' as a method to capture and value these personal experiences with the same critical depth applied to literature and film.

6 days ago · 9 points