Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws
TL;DR
This lecture explores practical implementations of scaling laws for language models, focusing on the MiniCPM paper's techniques for stabilizing training across scales, including Maximal Update Parameterization (MUP) for consistent learning rates and Warm-up Stable Decay (WSD) schedules for efficient training continuation.
🧮 Maximal Update Parameterization (MUP) 3 insights
Stabilize learning rates across model scales
MUP ensures the optimal learning rate remains constant (approximately 10^-2) regardless of model size by scaling embedding outputs, residual connections by the square root of layer count, and weight initializations by fan-out ratios.
Implement per-parameter learning rate scaling
Unlike standard training, MUP requires setting different learning rates for specific tensor types such as embeddings, LM heads, and matrix weights rather than using a single global rate.
Enable efficient scaling ladders
Stable hyperparameters allow researchers to train small proxy models and extrapolate directly to target sizes with 5x parameter gaps, eliminating the need to brute-force tune large models.
📊 Batch Size and Compute Optimization 2 insights
Optimal batch size scales with target loss
Following Kaplan's critical batch size analysis, experiments show optimal batch size follows a power law with respect to the loss target, where achieving lower loss requires proportionally larger batch sizes.
Derive batch size from loss targets
Researchers can precisely set batch sizes using scaling curves that map specific loss targets to their corresponding optimal batch sizes, balancing compute efficiency against sample efficiency.
⏱️ Warm-Up Stable Decay (WSD) Schedules 3 insights
Enable training extension without restart
The trapezoidal WSD schedule maintains a constant learning rate during a long stable phase comprising 80-90% of training, allowing researchers to rewind to any stable checkpoint and extend training rather than restarting from scratch.
Concentrate gains in rapid decay phase
The final 10-20% of training uses rapid decay to approximately 10% of the maximum learning rate, capturing significant loss improvements that match or exceed cosine schedule performance.
Simplify Chinchilla scaling analyses
WSD eliminates the quadratic cost of traditional Chinchilla isoflops experiments by allowing a single long run to serve multiple data scaling analyses through checkpoint rewinding and re-decay.
Bottom Line
Implement Maximal Update Parameterization (MUP) to fix your optimal learning rate across different model scales, and adopt Warm-up Stable Decay (WSD) schedules to enable efficient training extension and scaling law experiments without costly full retraining.
More from Stanford Online
View all
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge
Former OpenAI researcher Yash Bottle traces AI model evolution from AlexNet to reasoning agents, identifying continual learning as the next bottleneck while explaining why code dominance stems from verifiable rewards and why enterprises must leverage proprietary data to bridge the gap between frontier models and business context.
Stanford MS&E435 | Spring 2026 | Economics of Generative AI
Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning
This seminar presents a neuro-symbolic approach to robot learning that combines neural visual representations with physics-based constraint optimization to enable one-shot skill acquisition, achieving over 90% success rates on novel objects compared to 0% for standard policy learning methods.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.