Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Stanford Online

| Podcasts | April 28, 2026 | 213 views | 1:21:03

TL;DR

This lecture introduces distributed training for large language models, explaining how to scale beyond single-GPU memory limits using collective communication primitives (all-reduce, all-gather, reduce-scatter) across hardware topologies ranging from NVLink-connected single nodes to InfiniBand-linked multi-node clusters.

🚀 The Multi-GPU Scaling Imperative 3 insights

Memory constraints drive distribution

Training trillion-parameter models requires multiple GPUs because model weights, activations, and optimizer states exceed single-GPU HBM capacity (e.g., 192GB on NVIDIA B200).

Speed scaling requires trade-offs

Even when models fit on one GPU, distributing computation accelerates training, though practitioners must calculate whether communication bandwidth costs outweigh the benefits of additional compute.

Extended memory hierarchy

The data locality challenge extends from single-GPU registers-to-HBM to a three-tier system: single GPU, NVLink-connected single-node multi-GPU, and InfiniBand/Ethernet-connected multi-node clusters.

🔄 Collective Communication Primitives 4 insights

Abstracting distributed operations

Collective operations provide high-level communication templates across ranks (devices), eliminating the need to manage complex point-to-point GPU transfers manually.

All-reduce for data parallelism

This operation sums gradients across all GPUs and replicates the result to every device, serving as the foundation for basic distributed training implementations.

Decomposed operations for efficiency

Advanced strategies like FSDP break all-reduce into reduce-scatter followed by all-gather, enabling intervention and more efficient management of optimizer states and parameter sharding.

All-to-all for dynamic routing

This general collective enables Mixture-of-Experts training by routing activations between ranks based on data-dependent decisions, functioning as a generalized matrix transpose across unbalanced loads.

🔌 Hardware Topology and Bandwidth 3 insights

Intra-node interconnects

Modern single-node systems utilize NVLink and NVSwitch for high-bandwidth GPU communication, while PCIe represents a legacy bottleneck unsuitable for serious distributed training.

Inter-node networking

Multi-node clusters rely on InfiniBand or standard Ethernet to connect servers, creating significant latency penalties that make minimizing cross-node data transfer critical for performance.

The data movement bottleneck

Effective parallelism requires orchestrating computation to keep data close to compute, treating remote GPU memory as the slowest tier in the hierarchy and avoiding unnecessary shuffles.

Bottom Line

Master collective communication primitives—specifically all-reduce, all-gather, and reduce-scatter—as they form the foundational building blocks for implementing efficient data, tensor, and pipeline parallelism across any GPU cluster.

Watch on YouTube

More from Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

about 9 hours ago · 9 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.

about 9 hours ago · 9 points

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

This lecture explains why high-dimensional pixel space (approximately 1 million dimensions for standard images) is computationally intractable for diffusion models, and how Variational Autoencoders (VAEs) solve this by compressing images into structured latent spaces that follow standard normal distributions, enabling efficient and meaningful generation.

about 9 hours ago · 7 points

Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Stanford Online

Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Albert Gu analyzes the fundamental tradeoffs between State Space Models (SSMs) and Transformers, framing SSMs as "brain-like" fixed-size state compressors that enable linear inference complexity versus Transformers' "database-like" KV cache approach that scales quadratically but enables precise retrieval.

1 day ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories