Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

| Podcasts | April 30, 2026 | 6.17 Thousand views | 1:17:57

TL;DR

This lecture introduces scaling laws as predictive power-law relationships that enable practitioners to optimize language model training on small budgets and confidently extrapolate performance to million-dollar large-scale runs, while tracing these empirical patterns back to classical machine learning theory and sample complexity research from the 1990s.

📜 Historical Foundations and Theory 3 insights

Power law scaling predates modern language models

Research from Bell Labs (1993) and Hestness et al. (2017) established that model error decays polynomially with dataset size across diverse domains including speech recognition and machine translation.

Scaling laws parallel classical generalization theory

These laws function as empirical measurements of sample complexity, directly paralleling classical generalization bounds that theoretically predict performance changes as training data grows.

Data scaling advantages documented in early NLP

Banko and Brill demonstrated that increasing data often outperforms algorithmic improvements, while Kachina et al. (2012) confirmed that power laws fit machine translation blue score improvements.

⚙️ Engineering Strategy and Risk Mitigation 3 insights

Optimize hyperparameters on small scale experiments first

Practitioners should tune all architectural choices and hyperparameters on small, cheap experiments rather than risking millions of dollars on direct large-scale training runs.

Extrapolate small model performance to large scale

Power-law relationships appear as linear trends in log-log plots, providing robust mathematical rules to extrapolate small-model behavior to large-scale compute and parameter regimes.

Scaling laws enable confident frontier model planning

This paradigm serves as an engineering "way of life" that allows confident planning of frontier model training through predictable regularities between resources and loss.

📉 Mathematical Forms and Empirical Patterns 3 insights

Test loss follows power law decay patterns

When plotted against dataset size, compute, or parameters, test loss follows power laws appearing as linear decay in log-log space far from the irreducible error floor.

Different metrics show distinct scaling law forms

While pre-training loss follows power laws, downstream benchmark accuracy typically follows sigmoid curves, and capability improvements show linear trends against deployment dates.

Power laws break down near irreducible error

These polynomial trends hold reliably when models operate far from the asymptotic entropy limit but break down as performance approaches the theoretical noise ceiling.

Bottom Line

Use small-scale experiments to fit power-law curves for loss versus compute, data, and parameters, then extrapolate these relationships to configure hyperparameters for large-scale training runs rather than optimizing directly at full scale.

More from Stanford Online

View all
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories
49:48
Stanford Online Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

1 day ago · 9 points
Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything
41:10
Stanford Online Stanford Online

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Sam Altman explains how AI has fundamentally altered startup economics, enabling small teams to achieve unprecedented scale, while sharing OpenAI's journey from research lab to product company and arguing that pushing systems beyond conventional scaling limits often reveals emergent properties that consensus thinking misses.

4 days ago · 10 points
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
59:34
Stanford Online Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play

The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.

13 days ago · 9 points