Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

| Podcasts | April 28, 2026 | 677 views | 1:20:11

TL;DR

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

🌐 Hardware Networking Architectures 3 insights

TPU toroidal mesh vs GPU fat tree topologies

TPUs traditionally use a toroidal mesh connecting only nearest neighbors, making them cost-effective for predictable dense models, while GPUs employ a fat tree topology enabling flexible all-to-all communication critical for mixture-of-experts architectures.

Convergent evolution toward all-to-all connectivity

Google's announcement of TPU v6 (with Virgo network) and TPU AI chips signals a shift toward tree-like all-to-all connectivity to handle modern workloads where tokens route stochastically to different experts during inference.

Huawei's brute-force scaling approach

The Huawei Ascend 910 compensates for slower individual chips by connecting 384 chips via massive fiber optic switching, achieving scale at the cost of 4x higher power consumption compared to equivalent Nvidia systems.

💾 The Memory Bottleneck Crisis 3 insights

Training requires ~16 bytes per parameter

A rule-of-thumb accounting suggests storing roughly five copies of weights per parameter during training, including parameters, gradients, high-precision accumulators, and Adam optimizer states (first and second moments).

Optimizer states dominate memory costs

Contrary to intuition, optimizer states consume the majority of memory (not model weights), making them the primary target for optimization when scaling to large models.

Naive replication causes linear memory scaling

Standard data parallelism replicates the full model, gradients, and optimizer states on every GPU, causing memory consumption to grow linearly with the number of accelerators rather than remaining constant.

Parallelism Strategies & Sharding 3 insights

Data parallelism communication overhead

Naive data parallelism requires communicating approximately two times the number of parameters per batch to synchronize gradients, offering compute scaling but zero memory scaling benefits.

Sharding optimizer states yields dramatic reductions

Distributing optimizer states across GPUs reduces per-device memory from 120 units to 1.9 units in the illustrated example, with further gains possible by additionally sharding gradients and parameters.

Intra-node vs inter-node communication constraints

Fast intra-node connections permit communication-heavy operations, while slower inter-node links require strategies that minimize cross-machine data transfer, necessitating hybrid parallelism approaches.

Bottom Line

To train models at data center scale, shard optimizer states and gradients across devices rather than replicating them, selecting your parallelism strategy based on whether your hardware uses neighbor-only mesh (TPU) or all-to-all tree (GPU) networking topologies.

More from Stanford Online

View all
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
1:40:58
Stanford Online Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

This lecture explains why high-dimensional pixel space (approximately 1 million dimensions for standard images) is computationally intractable for diffusion models, and how Variational Autoencoders (VAEs) solve this by compressing images into structured latent spaces that follow standard normal distributions, enabling efficient and meaningful generation.

about 9 hours ago · 7 points