Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)
TL;DR
This lecture covers resource accounting fundamentals for training large language models, including FLOPs calculations and memory estimation, explores numerical precision trade-offs from FP32 down to FP4, and introduces einops as a readable alternative to PyTorch tensor operations using named dimensions.
💻 Resource Accounting & Training Efficiency 3 insights
Training FLOPs calculation formula
Estimate total compute required using 6 × parameters × tokens, which represents the floating point operations needed for a full training run.
Hardware utilization reality check
Training a 70B parameter model on 15 trillion tokens requires approximately 143 days on 1,024 H100 GPUs assuming 50% Model FLOPs Utilization (MFU).
Memory-constrained capacity limits
Eight H100s with 80GB memory can theoretically train approximately 53 billion parameters using AdamW, calculated from 12 bytes per parameter (weights, gradients, and two optimizer states).
🔢 Numerical Precision Trade-offs 3 insights
BF16 as the practical training standard
Brain Float 16 maintains the same dynamic range as FP32 with half the memory footprint, avoiding the underflow and overflow instability that plagues FP16 training.
Mixed precision training strategy
Use BF16 for parameters, activations, and gradients while keeping optimizer states in FP32 for numerical stability, typically managed via PyTorch's AMP library.
Frontier quantization formats
FP8 and FP4 formats with block scaling enable extreme memory reduction, with Nemotron 3 Super demonstrating successful training at 4-bit precision using NVIDIA's transformer engine.
🧮 Einops for Tensor Operations 3 insights
Named dimensions replace integer indices
Einops uses explicit dimension names instead of cryptic axis numbers like -1 or -2, eliminating transpose errors and making tensor shapes self-documenting.
Einstein summation simplified
Operations follow input-to-output notation where dimensions listed only on the input side are automatically summed, such as 'seq hidden, hidden seq2 -> seq seq2' for matrix multiplication.
Arbitrary batch dimension handling
The ellipsis (...) syntax accommodates any number of leading batch dimensions without explicit enumeration, crucial for language modeling with variable batch and sequence structures.
Bottom Line
Always perform back-of-the-envelope resource accounting before training, default to BF16 mixed precision for optimal efficiency, and use einops to write readable tensor operations that prevent shape errors.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization
This lecture introduces Stanford CS336's philosophy of building language models from scratch to understand fundamentals rather than relying on abstractions, addressing how researchers can navigate the disconnect caused by industrialized, closed frontier models by focusing on transferable mechanics and efficiency-minded mindsets.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
This lecture introduces score matching as an alternative to DDPM for generative modeling, where instead of predicting noise directly, models estimate the gradient of log probability density (the 'score') to guide sampling from noise toward data distributions using Langevin dynamics.
Stanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox
This seminar reframes Moravec's Paradox through control theory, demonstrating why robot learning suffers from exponential compounding errors that symbolic tasks avoid, and identifies action chunking and generative control policies as the essential algorithmic breakthroughs that enabled the 2023 inflection point in robotic manipulation capabilities.
Stanford CS547 HCI Seminar | Winter 2026 | Computational Ecosystems
The speaker argues that to solve persistent human problems in HCI, designers must move beyond building better tools and instead critically reimagine entire socio-technical ecosystems. Through examples in event planning, crowdsourcing, social connection, and education, he demonstrates how redesigning human practices—what he terms "critical technical practice"—can unlock values that pure technological advancement has failed to address.