Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

| Podcasts | April 28, 2026 | 5.42 Thousand views | 1:20:11

TL;DR

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

🌐 Hardware Networking Architectures 3 insights

TPU toroidal mesh vs GPU fat tree topologies

TPUs traditionally use a toroidal mesh connecting only nearest neighbors, making them cost-effective for predictable dense models, while GPUs employ a fat tree topology enabling flexible all-to-all communication critical for mixture-of-experts architectures.

Convergent evolution toward all-to-all connectivity

Google's announcement of TPU v6 (with Virgo network) and TPU AI chips signals a shift toward tree-like all-to-all connectivity to handle modern workloads where tokens route stochastically to different experts during inference.

Huawei's brute-force scaling approach

The Huawei Ascend 910 compensates for slower individual chips by connecting 384 chips via massive fiber optic switching, achieving scale at the cost of 4x higher power consumption compared to equivalent Nvidia systems.

💾 The Memory Bottleneck Crisis 3 insights

Training requires ~16 bytes per parameter

A rule-of-thumb accounting suggests storing roughly five copies of weights per parameter during training, including parameters, gradients, high-precision accumulators, and Adam optimizer states (first and second moments).

Optimizer states dominate memory costs

Contrary to intuition, optimizer states consume the majority of memory (not model weights), making them the primary target for optimization when scaling to large models.

Naive replication causes linear memory scaling

Standard data parallelism replicates the full model, gradients, and optimizer states on every GPU, causing memory consumption to grow linearly with the number of accelerators rather than remaining constant.

Parallelism Strategies & Sharding 3 insights

Data parallelism communication overhead

Naive data parallelism requires communicating approximately two times the number of parameters per batch to synchronize gradients, offering compute scaling but zero memory scaling benefits.

Sharding optimizer states yields dramatic reductions

Distributing optimizer states across GPUs reduces per-device memory from 120 units to 1.9 units in the illustrated example, with further gains possible by additionally sharding gradients and parameters.

Intra-node vs inter-node communication constraints

Fast intra-node connections permit communication-heavy operations, while slower inter-node links require strategies that minimize cross-machine data transfer, necessitating hybrid parallelism approaches.

Bottom Line

To train models at data center scale, shard optimizer states and gradients across devices rather than replicating them, selecting your parallelism strategy based on whether your hardware uses neighbor-only mesh (TPU) or all-to-all tree (GPU) networking topologies.

More from Stanford Online

View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
59:34
Stanford Online Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play

The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.

10 days ago · 9 points