Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

| Podcasts | April 28, 2026 | 677 views | 1:20:11

TL;DR

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

🌐 Hardware Networking Architectures 3 insights

TPU toroidal mesh vs GPU fat tree topologies

TPUs traditionally use a toroidal mesh connecting only nearest neighbors, making them cost-effective for predictable dense models, while GPUs employ a fat tree topology enabling flexible all-to-all communication critical for mixture-of-experts architectures.

Convergent evolution toward all-to-all connectivity

Google's announcement of TPU v6 (with Virgo network) and TPU AI chips signals a shift toward tree-like all-to-all connectivity to handle modern workloads where tokens route stochastically to different experts during inference.

Huawei's brute-force scaling approach

The Huawei Ascend 910 compensates for slower individual chips by connecting 384 chips via massive fiber optic switching, achieving scale at the cost of 4x higher power consumption compared to equivalent Nvidia systems.

💾 The Memory Bottleneck Crisis 3 insights

Training requires ~16 bytes per parameter

A rule-of-thumb accounting suggests storing roughly five copies of weights per parameter during training, including parameters, gradients, high-precision accumulators, and Adam optimizer states (first and second moments).

Optimizer states dominate memory costs

Contrary to intuition, optimizer states consume the majority of memory (not model weights), making them the primary target for optimization when scaling to large models.

Naive replication causes linear memory scaling

Standard data parallelism replicates the full model, gradients, and optimizer states on every GPU, causing memory consumption to grow linearly with the number of accelerators rather than remaining constant.

⚡ Parallelism Strategies & Sharding 3 insights

Data parallelism communication overhead

Naive data parallelism requires communicating approximately two times the number of parameters per batch to synchronize gradients, offering compute scaling but zero memory scaling benefits.

Sharding optimizer states yields dramatic reductions

Distributing optimizer states across GPUs reduces per-device memory from 120 units to 1.9 units in the illustrated example, with further gains possible by additionally sharding gradients and parameters.

Intra-node vs inter-node communication constraints

Fast intra-node connections permit communication-heavy operations, while slower inter-node links require strategies that minimize cross-machine data transfer, necessitating hybrid parallelism approaches.

Bottom Line

To train models at data center scale, shard optimizer states and gradients across devices rather than replicating them, selecting your parallelism strategy based on whether your hardware uses neighbor-only mesh (TPU) or all-to-all tree (GPU) networking topologies.

Watch on YouTube

More from Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

This lecture introduces distributed training for large language models, explaining how to scale beyond single-GPU memory limits using collective communication primitives (all-reduce, all-gather, reduce-scatter) across hardware topologies ranging from NVLink-connected single nodes to InfiniBand-linked multi-node clusters.

about 9 hours ago · 10 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.

about 9 hours ago · 9 points

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

This lecture explains why high-dimensional pixel space (approximately 1 million dimensions for standard images) is computationally intractable for diffusion models, and how Variational Autoencoders (VAEs) solve this by compressing images into structured latent spaces that follow standard normal distributions, enabling efficient and meaningful generation.

about 9 hours ago · 7 points

Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Stanford Online

Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Albert Gu analyzes the fundamental tradeoffs between State Space Models (SSMs) and Transformers, framing SSMs as "brain-like" fixed-size state compressors that enable linear inference complexity versus Transformers' "database-like" KV cache approach that scales quadratically but enables precise retrieval.

1 day ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories