Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

| Podcasts | May 19, 2026 | 1.87 Thousand views | 1:17:04

TL;DR

This lecture explores practical implementations of scaling laws for language models, focusing on the MiniCPM paper's techniques for stabilizing training across scales, including Maximal Update Parameterization (MUP) for consistent learning rates and Warm-up Stable Decay (WSD) schedules for efficient training continuation.

🧮 Maximal Update Parameterization (MUP) 3 insights

Stabilize learning rates across model scales

MUP ensures the optimal learning rate remains constant (approximately 10^-2) regardless of model size by scaling embedding outputs, residual connections by the square root of layer count, and weight initializations by fan-out ratios.

Implement per-parameter learning rate scaling

Unlike standard training, MUP requires setting different learning rates for specific tensor types such as embeddings, LM heads, and matrix weights rather than using a single global rate.

Enable efficient scaling ladders

Stable hyperparameters allow researchers to train small proxy models and extrapolate directly to target sizes with 5x parameter gaps, eliminating the need to brute-force tune large models.

📊 Batch Size and Compute Optimization 2 insights

Optimal batch size scales with target loss

Following Kaplan's critical batch size analysis, experiments show optimal batch size follows a power law with respect to the loss target, where achieving lower loss requires proportionally larger batch sizes.

Derive batch size from loss targets

Researchers can precisely set batch sizes using scaling curves that map specific loss targets to their corresponding optimal batch sizes, balancing compute efficiency against sample efficiency.

⏱️ Warm-Up Stable Decay (WSD) Schedules 3 insights

Enable training extension without restart

The trapezoidal WSD schedule maintains a constant learning rate during a long stable phase comprising 80-90% of training, allowing researchers to rewind to any stable checkpoint and extend training rather than restarting from scratch.

Concentrate gains in rapid decay phase

The final 10-20% of training uses rapid decay to approximately 10% of the maximum learning rate, capturing significant loss improvements that match or exceed cosine schedule performance.

Simplify Chinchilla scaling analyses

WSD eliminates the quadratic cost of traditional Chinchilla isoflops experiments by allowing a single long run to serve multiple data scaling analyses through checkpoint rewinding and re-decay.

Bottom Line

Implement Maximal Update Parameterization (MUP) to fix your optimal learning rate across different model scales, and adopt Warm-up Stable Decay (WSD) schedules to enable efficient training extension and scaling law experiments without costly full retraining.

More from Stanford Online

View all
Stanford MS&E435 | Spring 2026 | Economics of Generative AI
34:13
Stanford Online Stanford Online

Stanford MS&E435 | Spring 2026 | Economics of Generative AI

Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.

2 days ago · 7 points
Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
1:11:12
Stanford Online Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy

UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.

2 days ago · 8 points