Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
TL;DR
This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.
🌐 Hardware Networking Architectures 3 insights
TPU toroidal mesh vs GPU fat tree topologies
TPUs traditionally use a toroidal mesh connecting only nearest neighbors, making them cost-effective for predictable dense models, while GPUs employ a fat tree topology enabling flexible all-to-all communication critical for mixture-of-experts architectures.
Convergent evolution toward all-to-all connectivity
Google's announcement of TPU v6 (with Virgo network) and TPU AI chips signals a shift toward tree-like all-to-all connectivity to handle modern workloads where tokens route stochastically to different experts during inference.
Huawei's brute-force scaling approach
The Huawei Ascend 910 compensates for slower individual chips by connecting 384 chips via massive fiber optic switching, achieving scale at the cost of 4x higher power consumption compared to equivalent Nvidia systems.
💾 The Memory Bottleneck Crisis 3 insights
Training requires ~16 bytes per parameter
A rule-of-thumb accounting suggests storing roughly five copies of weights per parameter during training, including parameters, gradients, high-precision accumulators, and Adam optimizer states (first and second moments).
Optimizer states dominate memory costs
Contrary to intuition, optimizer states consume the majority of memory (not model weights), making them the primary target for optimization when scaling to large models.
Naive replication causes linear memory scaling
Standard data parallelism replicates the full model, gradients, and optimizer states on every GPU, causing memory consumption to grow linearly with the number of accelerators rather than remaining constant.
⚡ Parallelism Strategies & Sharding 3 insights
Data parallelism communication overhead
Naive data parallelism requires communicating approximately two times the number of parameters per batch to synchronize gradients, offering compute scaling but zero memory scaling benefits.
Sharding optimizer states yields dramatic reductions
Distributing optimizer states across GPUs reduces per-device memory from 120 units to 1.9 units in the illustrated example, with further gains possible by additionally sharding gradients and parameters.
Intra-node vs inter-node communication constraints
Fast intra-node connections permit communication-heavy operations, while slower inter-node links require strategies that minimize cross-machine data transfer, necessitating hybrid parallelism approaches.
Bottom Line
To train models at data center scale, shard optimizer states and gradients across devices rather than replicating them, selecting your parallelism strategy based on whether your hardware uses neighbor-only mesh (TPU) or all-to-all tree (GPU) networking topologies.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
This lecture introduces distributed training for large language models, explaining how to scale beyond single-GPU memory limits using collective communication primitives (all-reduce, all-gather, reduce-scatter) across hardware topologies ranging from NVLink-connected single nodes to InfiniBand-linked multi-node clusters.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
This lecture explains why high-dimensional pixel space (approximately 1 million dimensions for standard images) is computationally intractable for diffusion models, and how Variational Autoencoders (VAEs) solve this by compressing images into structured latent spaces that follow standard normal distributions, enabling efficient and meaningful generation.
Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers
Albert Gu analyzes the fundamental tradeoffs between State Space Models (SSMs) and Transformers, framing SSMs as "brain-like" fixed-size state compressors that enable linear inference complexity versus Transformers' "database-like" KV cache approach that scales quadratically but enables precise retrieval.