Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
TL;DR
This lecture introduces distributed training for large language models, explaining how to scale beyond single-GPU memory limits using collective communication primitives (all-reduce, all-gather, reduce-scatter) across hardware topologies ranging from NVLink-connected single nodes to InfiniBand-linked multi-node clusters.
🚀 The Multi-GPU Scaling Imperative 3 insights
Memory constraints drive distribution
Training trillion-parameter models requires multiple GPUs because model weights, activations, and optimizer states exceed single-GPU HBM capacity (e.g., 192GB on NVIDIA B200).
Speed scaling requires trade-offs
Even when models fit on one GPU, distributing computation accelerates training, though practitioners must calculate whether communication bandwidth costs outweigh the benefits of additional compute.
Extended memory hierarchy
The data locality challenge extends from single-GPU registers-to-HBM to a three-tier system: single GPU, NVLink-connected single-node multi-GPU, and InfiniBand/Ethernet-connected multi-node clusters.
🔄 Collective Communication Primitives 4 insights
Abstracting distributed operations
Collective operations provide high-level communication templates across ranks (devices), eliminating the need to manage complex point-to-point GPU transfers manually.
All-reduce for data parallelism
This operation sums gradients across all GPUs and replicates the result to every device, serving as the foundation for basic distributed training implementations.
Decomposed operations for efficiency
Advanced strategies like FSDP break all-reduce into reduce-scatter followed by all-gather, enabling intervention and more efficient management of optimizer states and parameter sharding.
All-to-all for dynamic routing
This general collective enables Mixture-of-Experts training by routing activations between ranks based on data-dependent decisions, functioning as a generalized matrix transpose across unbalanced loads.
🔌 Hardware Topology and Bandwidth 3 insights
Intra-node interconnects
Modern single-node systems utilize NVLink and NVSwitch for high-bandwidth GPU communication, while PCIe represents a legacy bottleneck unsuitable for serious distributed training.
Inter-node networking
Multi-node clusters rely on InfiniBand or standard Ethernet to connect servers, creating significant latency penalties that make minimizing cross-node data transfer critical for performance.
The data movement bottleneck
Effective parallelism requires orchestrating computation to keep data close to compute, treating remote GPU memory as the slowest tier in the hierarchy and avoiding unnecessary shuffles.
Bottom Line
Master collective communication primitives—specifically all-reduce, all-gather, and reduce-scatter—as they form the foundational building blocks for implementing efficient data, tensor, and pipeline parallelism across any GPU cluster.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.