Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)

Stanford Online

| Podcasts | April 14, 2026 | 18.6 Thousand views | 1:17:25

TL;DR

This lecture covers resource accounting fundamentals for training large language models, including FLOPs calculations and memory estimation, explores numerical precision trade-offs from FP32 down to FP4, and introduces einops as a readable alternative to PyTorch tensor operations using named dimensions.

💻 Resource Accounting & Training Efficiency 3 insights

Training FLOPs calculation formula

Estimate total compute required using 6 × parameters × tokens, which represents the floating point operations needed for a full training run.

Hardware utilization reality check

Training a 70B parameter model on 15 trillion tokens requires approximately 143 days on 1,024 H100 GPUs assuming 50% Model FLOPs Utilization (MFU).

Memory-constrained capacity limits

Eight H100s with 80GB memory can theoretically train approximately 53 billion parameters using AdamW, calculated from 12 bytes per parameter (weights, gradients, and two optimizer states).

🔢 Numerical Precision Trade-offs 3 insights

BF16 as the practical training standard

Brain Float 16 maintains the same dynamic range as FP32 with half the memory footprint, avoiding the underflow and overflow instability that plagues FP16 training.

Mixed precision training strategy

Use BF16 for parameters, activations, and gradients while keeping optimizer states in FP32 for numerical stability, typically managed via PyTorch's AMP library.

Frontier quantization formats

FP8 and FP4 formats with block scaling enable extreme memory reduction, with Nemotron 3 Super demonstrating successful training at 4-bit precision using NVIDIA's transformer engine.

🧮 Einops for Tensor Operations 3 insights

Named dimensions replace integer indices

Einops uses explicit dimension names instead of cryptic axis numbers like -1 or -2, eliminating transpose errors and making tensor shapes self-documenting.

Einstein summation simplified

Operations follow input-to-output notation where dimensions listed only on the input side are automatically summed, such as 'seq hidden, hidden seq2 -> seq seq2' for matrix multiplication.

Arbitrary batch dimension handling

The ellipsis (...) syntax accommodates any number of leading batch dimensions without explicit enumeration, crucial for language modeling with variable batch and sequence structures.

Bottom Line

Always perform back-of-the-envelope resource accounting before training, default to BF16 mixed precision for optimal efficiency, and use einops to write readable tensor operations that prevent shape errors.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

7 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

21 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

27 days ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

29 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories