Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
TL;DR
This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.
🌐 Hardware Networking Architectures 3 insights
TPU toroidal mesh vs GPU fat tree topologies
TPUs traditionally use a toroidal mesh connecting only nearest neighbors, making them cost-effective for predictable dense models, while GPUs employ a fat tree topology enabling flexible all-to-all communication critical for mixture-of-experts architectures.
Convergent evolution toward all-to-all connectivity
Google's announcement of TPU v6 (with Virgo network) and TPU AI chips signals a shift toward tree-like all-to-all connectivity to handle modern workloads where tokens route stochastically to different experts during inference.
Huawei's brute-force scaling approach
The Huawei Ascend 910 compensates for slower individual chips by connecting 384 chips via massive fiber optic switching, achieving scale at the cost of 4x higher power consumption compared to equivalent Nvidia systems.
💾 The Memory Bottleneck Crisis 3 insights
Training requires ~16 bytes per parameter
A rule-of-thumb accounting suggests storing roughly five copies of weights per parameter during training, including parameters, gradients, high-precision accumulators, and Adam optimizer states (first and second moments).
Optimizer states dominate memory costs
Contrary to intuition, optimizer states consume the majority of memory (not model weights), making them the primary target for optimization when scaling to large models.
Naive replication causes linear memory scaling
Standard data parallelism replicates the full model, gradients, and optimizer states on every GPU, causing memory consumption to grow linearly with the number of accelerators rather than remaining constant.
⚡ Parallelism Strategies & Sharding 3 insights
Data parallelism communication overhead
Naive data parallelism requires communicating approximately two times the number of parameters per batch to synchronize gradients, offering compute scaling but zero memory scaling benefits.
Sharding optimizer states yields dramatic reductions
Distributing optimizer states across GPUs reduces per-device memory from 120 units to 1.9 units in the illustrated example, with further gains possible by additionally sharding gradients and parameters.
Intra-node vs inter-node communication constraints
Fast intra-node connections permit communication-heavy operations, while slower inter-node links require strategies that minimize cross-machine data transfer, necessitating hybrid parallelism approaches.
Bottom Line
To train models at data center scale, shard optimizer states and gradients across devices rather than replicating them, selecting your parallelism strategy based on whether your hardware uses neighbor-only mesh (TPU) or all-to-all tree (GPU) networking topologies.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.