Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

Stanford Online

| Podcasts | April 30, 2026 | 1.57 Thousand views | 1:17:57

TL;DR

This lecture introduces scaling laws as predictive power-law relationships that enable practitioners to optimize language model training on small budgets and confidently extrapolate performance to million-dollar large-scale runs, while tracing these empirical patterns back to classical machine learning theory and sample complexity research from the 1990s.

📜 Historical Foundations and Theory 3 insights

Power law scaling predates modern language models

Research from Bell Labs (1993) and Hestness et al. (2017) established that model error decays polynomially with dataset size across diverse domains including speech recognition and machine translation.

Scaling laws parallel classical generalization theory

These laws function as empirical measurements of sample complexity, directly paralleling classical generalization bounds that theoretically predict performance changes as training data grows.

Data scaling advantages documented in early NLP

Banko and Brill demonstrated that increasing data often outperforms algorithmic improvements, while Kachina et al. (2012) confirmed that power laws fit machine translation blue score improvements.

⚙️ Engineering Strategy and Risk Mitigation 3 insights

Optimize hyperparameters on small scale experiments first

Practitioners should tune all architectural choices and hyperparameters on small, cheap experiments rather than risking millions of dollars on direct large-scale training runs.

Extrapolate small model performance to large scale

Power-law relationships appear as linear trends in log-log plots, providing robust mathematical rules to extrapolate small-model behavior to large-scale compute and parameter regimes.

Scaling laws enable confident frontier model planning

This paradigm serves as an engineering "way of life" that allows confident planning of frontier model training through predictable regularities between resources and loss.

📉 Mathematical Forms and Empirical Patterns 3 insights

Test loss follows power law decay patterns

When plotted against dataset size, compute, or parameters, test loss follows power laws appearing as linear decay in log-log space far from the irreducible error floor.

Different metrics show distinct scaling law forms

While pre-training loss follows power laws, downstream benchmark accuracy typically follows sigmoid curves, and capability improvements show linear trends against deployment dates.

Power laws break down near irreducible error

These polynomial trends hold reliably when models operate far from the asymptotic entropy limit but break down as performance approaches the theoretical noise ceiling.

Bottom Line

Use small-scale experiments to fit power-law curves for loss versus compute, data, and parameters, then extrapolate these relationships to configure hyperparameters for large-scale training runs rather than optimizing directly at full scale.

Watch on YouTube

More from Stanford Online

Stanford CS153 Frontier Systems | Anjney Midha from AMP PBC on Frontier Systems

Stanford Online

Stanford CS153 Frontier Systems | Anjney Midha from AMP PBC on Frontier Systems

Anjney Midha frames the current AI landscape as 'the great transition,' where industrial-scale model training meets a complete restructuring of the eight-layer infrastructure stack, while arguing that relationships and obsessions remain the ultimate asymmetric advantages for founders against entrenched incumbents.

2 days ago · 7 points

Stanford Robotics Seminar ENGR319 | Spring 2026 | Ingredientsfor Long-Horizon Robot Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Ingredientsfor Long-Horizon Robot Autonomy

A researcher from Physical Intelligence argues that while robots now excel at short, dexterous tasks, true utility requires long-horizon autonomy for complex jobs like cleaning apartments or assembling server racks. The talk introduces MEM (Multiscale Embodied Memory), a system that uses compressed visual and linguistic memory to solve the latency and distribution shift problems that have historically prevented robots from tracking progress over extended time periods.

2 days ago · 8 points

Stanford CS547 HCI Seminar | Spring 2026 | Observing the User Experience in 2026

Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | Observing the User Experience in 2026

Mike Kuniavsky and Elizabeth Goodman examine how AI has revolutionized UX research by automating traditional methods while simultaneously creating an 'authenticity crisis' through synthetic users and widespread fraud, arguing that maintaining 'ground truth' through direct human contact remains essential for valid insights and organizational influence.

3 days ago · 8 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

4 days ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories