Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws
TL;DR
This lecture introduces scaling laws as predictive power-law relationships that enable practitioners to optimize language model training on small budgets and confidently extrapolate performance to million-dollar large-scale runs, while tracing these empirical patterns back to classical machine learning theory and sample complexity research from the 1990s.
📜 Historical Foundations and Theory 3 insights
Power law scaling predates modern language models
Research from Bell Labs (1993) and Hestness et al. (2017) established that model error decays polynomially with dataset size across diverse domains including speech recognition and machine translation.
Scaling laws parallel classical generalization theory
These laws function as empirical measurements of sample complexity, directly paralleling classical generalization bounds that theoretically predict performance changes as training data grows.
Data scaling advantages documented in early NLP
Banko and Brill demonstrated that increasing data often outperforms algorithmic improvements, while Kachina et al. (2012) confirmed that power laws fit machine translation blue score improvements.
⚙️ Engineering Strategy and Risk Mitigation 3 insights
Optimize hyperparameters on small scale experiments first
Practitioners should tune all architectural choices and hyperparameters on small, cheap experiments rather than risking millions of dollars on direct large-scale training runs.
Extrapolate small model performance to large scale
Power-law relationships appear as linear trends in log-log plots, providing robust mathematical rules to extrapolate small-model behavior to large-scale compute and parameter regimes.
Scaling laws enable confident frontier model planning
This paradigm serves as an engineering "way of life" that allows confident planning of frontier model training through predictable regularities between resources and loss.
📉 Mathematical Forms and Empirical Patterns 3 insights
Test loss follows power law decay patterns
When plotted against dataset size, compute, or parameters, test loss follows power laws appearing as linear decay in log-log space far from the irreducible error floor.
Different metrics show distinct scaling law forms
While pre-training loss follows power laws, downstream benchmark accuracy typically follows sigmoid curves, and capability improvements show linear trends against deployment dates.
Power laws break down near irreducible error
These polynomial trends hold reliably when models operate far from the asymptotic entropy limit but break down as performance approaches the theoretical noise ceiling.
Bottom Line
Use small-scale experiments to fit power-law curves for loss versus compute, data, and parameters, then extrapolate these relationships to configure hyperparameters for large-scale training runs rather than optimizing directly at full scale.
More from Stanford Online
View all
Stanford CS153 Frontier Systems | Anjney Midha from AMP PBC on Frontier Systems
Anjney Midha frames the current AI landscape as 'the great transition,' where industrial-scale model training meets a complete restructuring of the eight-layer infrastructure stack, while arguing that relationships and obsessions remain the ultimate asymmetric advantages for founders against entrenched incumbents.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Ingredientsfor Long-Horizon Robot Autonomy
A researcher from Physical Intelligence argues that while robots now excel at short, dexterous tasks, true utility requires long-horizon autonomy for complex jobs like cleaning apartments or assembling server racks. The talk introduces MEM (Multiscale Embodied Memory), a system that uses compressed visual and linguistic memory to solve the latency and distribution shift problems that have historically prevented robots from tracking progress over extended time periods.
Stanford CS547 HCI Seminar | Spring 2026 | Observing the User Experience in 2026
Mike Kuniavsky and Elizabeth Goodman examine how AI has revolutionized UX research by automating traditional methods while simultaneously creating an 'authenticity crisis' through synthetic users and widespread fraud, arguing that maintaining 'ground truth' through direct human contact remains essential for valid insights and organizational influence.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.