Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

Stanford Online

| Podcasts | May 19, 2026 | 1.87 Thousand views | 1:17:04

TL;DR

This lecture explores practical implementations of scaling laws for language models, focusing on the MiniCPM paper's techniques for stabilizing training across scales, including Maximal Update Parameterization (MUP) for consistent learning rates and Warm-up Stable Decay (WSD) schedules for efficient training continuation.

🧮 Maximal Update Parameterization (MUP) 3 insights

Stabilize learning rates across model scales

MUP ensures the optimal learning rate remains constant (approximately 10^-2) regardless of model size by scaling embedding outputs, residual connections by the square root of layer count, and weight initializations by fan-out ratios.

Implement per-parameter learning rate scaling

Unlike standard training, MUP requires setting different learning rates for specific tensor types such as embeddings, LM heads, and matrix weights rather than using a single global rate.

Enable efficient scaling ladders

Stable hyperparameters allow researchers to train small proxy models and extrapolate directly to target sizes with 5x parameter gaps, eliminating the need to brute-force tune large models.

📊 Batch Size and Compute Optimization 2 insights

Optimal batch size scales with target loss

Following Kaplan's critical batch size analysis, experiments show optimal batch size follows a power law with respect to the loss target, where achieving lower loss requires proportionally larger batch sizes.

Derive batch size from loss targets

Researchers can precisely set batch sizes using scaling curves that map specific loss targets to their corresponding optimal batch sizes, balancing compute efficiency against sample efficiency.

⏱️ Warm-Up Stable Decay (WSD) Schedules 3 insights

Enable training extension without restart

The trapezoidal WSD schedule maintains a constant learning rate during a long stable phase comprising 80-90% of training, allowing researchers to rewind to any stable checkpoint and extend training rather than restarting from scratch.

Concentrate gains in rapid decay phase

The final 10-20% of training uses rapid decay to approximately 10% of the maximum learning rate, capturing significant loss improvements that match or exceed cosine schedule performance.

Simplify Chinchilla scaling analyses

WSD eliminates the quadratic cost of traditional Chinchilla isoflops experiments by allowing a single long run to serve multiple data scaling analyses through checkpoint rewinding and re-decay.

Bottom Line

Implement Maximal Update Parameterization (MUP) to fix your optimal learning rate across different model scales, and adopt Warm-up Stable Decay (WSD) schedules to enable efficient training extension and scaling law experiments without costly full retraining.

Watch on YouTube

More from Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge

Former OpenAI researcher Yash Bottle traces AI model evolution from AlexNet to reasoning agents, identifying continual learning as the next bottleneck while explaining why code dominance stems from verifiable rewards and why enterprises must leverage proprietary data to bridge the gap between frontier models and business context.

about 11 hours ago · 10 points

Stanford MS&E435 | Spring 2026 | Economics of Generative AI

Stanford Online

Stanford MS&E435 | Spring 2026 | Economics of Generative AI

Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.

2 days ago · 7 points

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

This seminar presents a neuro-symbolic approach to robot learning that combines neural visual representations with physics-based constraint optimization to enable one-shot skill acquisition, achieving over 90% success rates on novel objects compared to 0% for standard policy learning methods.

2 days ago · 9 points

Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy

UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.

2 days ago · 8 points

Browse more: 🎙️ Podcasts All Videos All Categories