Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 5: GPUs, TPUs

Stanford Online

| Podcasts | April 20, 2026 | 14.2 Thousand views | 1:18:39

TL;DR

This lecture introduces GPU architecture for language model training, explaining the shift from serial CPU execution to parallel GPU throughput, the critical importance of memory hierarchies, and the SIMT programming model essential for efficient deep learning systems.

📉 The End of Serial Scaling 3 insights

Dennard scaling collapse forced parallel architectures

CPU clock speeds stopped increasing around 2005 due to fundamental physical limits on transistor scaling, forcing the industry to abandon faster serial execution in favor of horizontal GPU parallelism.

GPUs maximize throughput not latency

Unlike CPUs designed for fast serial execution with complex branching, GPUs utilize thousands of lightweight cores that accept higher individual task latency to maximize aggregate floating-point operations per second.

Super-exponential FLOP growth since 2017

Beginning with P100 and V100 chips, GPU compute capacity scaled dramatically through hardware innovations including tensor cores, structured sparsity, and reduced precision formats like FP8.

🧩 GPU Hardware Architecture 3 insights

Streaming Multiprocessors are independent compute units

Modern GPUs like the A100 contain 128 Streaming Multiprocessors, each acting as a discrete core with internal streaming processors that execute threads in parallel and maintain dedicated access to fast local memory.

Memory distance creates 20x latency penalties

Global memory resides physically distant from compute chips with approximately 400 cycle latency, while L1 and shared memory live inside Streaming Multiprocessors with only 20-30 cycle access speeds.

Shared memory enables fast thread cooperation

Located within each SM, shared memory allows threads within a block to communicate and reuse data rapidly, though it costs hundreds of times more per byte than global DRAM and consumes significantly more power.

⚙️ Programming Model and Execution 3 insights

SIMT architecture requires lockstep warp execution

GPUs execute threads in 32-thread warps following Single Instruction Multiple Thread principles, meaning all threads in a warp must execute identical instructions simultaneously on different data inputs.

Blocks guarantee SM residency for memory sharing

Thread blocks are scheduling units guaranteed to execute on a single Streaming Multiprocessor, granting them exclusive access to that SM's fast shared memory pool for inter-thread data reuse.

Warps serve as the hardware scheduling unit

The GPU scheduler dispatches and manages execution in groups of 32 threads called warps rather than individual threads, reducing scheduling overhead but creating divergence penalties when threads take different code paths.

🚀 Systems-Aware Optimization 3 insights

Hardware knowledge enables efficient model design

Understanding GPU execution models is essential for architecture design because efficient scaling requires maximizing resource utilization through hardware-aware algorithm choices rather than just theoretical compute counts.

Throughput varies non-linearly with matrix dimensions

GPU performance on matrix multiplication exhibits complex patterns where specific matrix sizes achieve dramatically higher throughput than others due to intricate interactions between memory hierarchies and compute units.

Flash Attention demonstrates hardware optimization

The lecture presents Flash Attention as a synthesis of GPU techniques including tiling and careful memory management, demonstrating how deep hardware knowledge enables algorithmic breakthroughs in transformer inference.

Bottom Line

Maximize language model training efficiency by minimizing slow global memory accesses and maximizing data reuse within the fast but limited shared memory and registers of GPU Streaming Multiprocessors.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

14 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

28 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories