Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

| Podcasts | March 09, 2026 | 3.14 Thousand views | 1:19:46

TL;DR

This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.

🏭 Industrial Scale of Modern LLMs 3 insights

Massive training datasets

Qwen 3 trains on 36 trillion tokens (~144 TB of text), equivalent to 90 billion sheets of paper stacked 9,000 km high—far exceeding the ISS orbital altitude of 400 km.

Extreme compute costs

Training Llama 3 requires ~3.9×10²⁵ FLOPs, costing approximately $42 million using H100 GPUs and taking 880,000 days on a single GPU versus 650,000 years on a MacBook.

Space-based infrastructure

Major tech companies including Google and Nvidia are actively exploring space-based data centers and GPU deployments to meet the unsustainable computational demands of future models.

📝 Fundamental Nature of Language 3 insights

Structure and vocabulary

Language models capture the statistical structure of sequences comprising vocabulary (allowable symbols) and grammar (combination rules) across natural languages, code, and sign language.

Probabilistic worldview encoding

When completing 'investors celebrated' versus 'investors panicked' after a market crash, the model reveals implicit beliefs and statistical frequencies derived from training data rather than objective truth.

Distribution over sequences

A language model fundamentally represents a probability distribution P(sequence) over all possible character combinations, assigning higher likelihoods to grammatically and semantically coherent continuations.

⚙️ Technical Architecture 3 insights

Tensor operations

Models process batched input embeddings (B×T×D matrices) through neural networks to output probability distributions (B×T×V) over the vocabulary at each position.

Auto-regressive generation

The model predicts one token, appends it to the input sequence, and repeats the process, enabling open-ended text completion through either greedy decoding or probabilistic sampling.

Chain rule decomposition

Joint probabilities factor into products of conditional probabilities P(word|previous context), making next-token prediction mathematically equivalent to modeling complete sequences.

🎓 Training Objectives 3 insights

Next-token prediction dominance

All major industrial models (GPT, Llama, Qwen) use next-token prediction as their primary pre-training objective, treating language modeling as multiclass classification over the vocabulary.

Masked language modeling

Alternative approaches predict missing interior tokens (e.g., filling 'import ___ as numpy'), though next-token prediction remains the universal standard for large-scale pre-training.

Batching for scale

Modern implementations process multiple sequences simultaneously while maintaining sequential dependencies, converting the fundamental operation into efficient high-throughput matrix computation.

Bottom Line

Language models are massive industrial artifacts that compress trillions of tokens into probability distributions, using next-token prediction to encode grammar, semantics, and implicit world knowledge through purely statistical auto-regressive training.

More from Stanford Online

View all
Stanford CS221 | Autumn 2025 | Lecture 20: Fireside Chat, Conclusion
58:49
Stanford Online Stanford Online

Stanford CS221 | Autumn 2025 | Lecture 20: Fireside Chat, Conclusion

Percy Liang reflects on AI's transformation from academic curiosity to global infrastructure, debunking sci-fi misconceptions about capabilities while arguing that academia's role in long-term research and critical evaluation remains essential as the job market shifts away from traditional entry-level software engineering.

16 days ago · 7 points
Stanford CS221 | Autumn 2025 | Lecture 19: AI Supply Chains
1:14:36
Stanford Online Stanford Online

Stanford CS221 | Autumn 2025 | Lecture 19: AI Supply Chains

This lecture examines AI's economic impact through the lens of supply chains and organizational strategy, demonstrating why understanding compute monopolies, labor market shifts, and corporate decision-making is as critical as tracking algorithmic capabilities.

16 days ago · 7 points
Stanford CS221 | Autumn 2025 | Lecture 18: AI & Society
1:12:10
Stanford Online Stanford Online

Stanford CS221 | Autumn 2025 | Lecture 18: AI & Society

This lecture argues that AI developers bear unique ethical responsibility for societal outcomes, framing AI as a dual-use technology that requires active steering toward beneficial applications while preventing misuse and accidental harms through rigorous auditing and an ecosystem-aware approach.

16 days ago · 8 points
Stanford CS221 | Autumn 2025 | Lecture 16: Logic II
1:15:47
Stanford Online Stanford Online

Stanford CS221 | Autumn 2025 | Lecture 16: Logic II

This lecture introduces First Order Logic as a powerful extension of propositional logic that uses objects, predicates, functions, and quantifiers to compactly represent complex relationships and generalizations without enumerating every possible instance.

16 days ago · 8 points