Stanford CS221 | Autumn 2025 | Lecture 17: Language Models
TL;DR
This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.
🏭 Industrial Scale of Modern LLMs 3 insights
Massive training datasets
Qwen 3 trains on 36 trillion tokens (~144 TB of text), equivalent to 90 billion sheets of paper stacked 9,000 km high—far exceeding the ISS orbital altitude of 400 km.
Extreme compute costs
Training Llama 3 requires ~3.9×10²⁵ FLOPs, costing approximately $42 million using H100 GPUs and taking 880,000 days on a single GPU versus 650,000 years on a MacBook.
Space-based infrastructure
Major tech companies including Google and Nvidia are actively exploring space-based data centers and GPU deployments to meet the unsustainable computational demands of future models.
📝 Fundamental Nature of Language 3 insights
Structure and vocabulary
Language models capture the statistical structure of sequences comprising vocabulary (allowable symbols) and grammar (combination rules) across natural languages, code, and sign language.
Probabilistic worldview encoding
When completing 'investors celebrated' versus 'investors panicked' after a market crash, the model reveals implicit beliefs and statistical frequencies derived from training data rather than objective truth.
Distribution over sequences
A language model fundamentally represents a probability distribution P(sequence) over all possible character combinations, assigning higher likelihoods to grammatically and semantically coherent continuations.
⚙️ Technical Architecture 3 insights
Tensor operations
Models process batched input embeddings (B×T×D matrices) through neural networks to output probability distributions (B×T×V) over the vocabulary at each position.
Auto-regressive generation
The model predicts one token, appends it to the input sequence, and repeats the process, enabling open-ended text completion through either greedy decoding or probabilistic sampling.
Chain rule decomposition
Joint probabilities factor into products of conditional probabilities P(word|previous context), making next-token prediction mathematically equivalent to modeling complete sequences.
🎓 Training Objectives 3 insights
Next-token prediction dominance
All major industrial models (GPT, Llama, Qwen) use next-token prediction as their primary pre-training objective, treating language modeling as multiclass classification over the vocabulary.
Masked language modeling
Alternative approaches predict missing interior tokens (e.g., filling 'import ___ as numpy'), though next-token prediction remains the universal standard for large-scale pre-training.
Batching for scale
Modern implementations process multiple sequences simultaneously while maintaining sequential dependencies, converting the fundamental operation into efficient high-throughput matrix computation.
Bottom Line
Language models are massive industrial artifacts that compress trillions of tokens into probability distributions, using next-token prediction to encode grammar, semantics, and implicit world knowledge through purely statistical auto-regressive training.
More from Stanford Online
View all
Stanford CS221 | Autumn 2025 | Lecture 20: Fireside Chat, Conclusion
Percy Liang reflects on AI's transformation from academic curiosity to global infrastructure, debunking sci-fi misconceptions about capabilities while arguing that academia's role in long-term research and critical evaluation remains essential as the job market shifts away from traditional entry-level software engineering.
Stanford CS221 | Autumn 2025 | Lecture 19: AI Supply Chains
This lecture examines AI's economic impact through the lens of supply chains and organizational strategy, demonstrating why understanding compute monopolies, labor market shifts, and corporate decision-making is as critical as tracking algorithmic capabilities.
Stanford CS221 | Autumn 2025 | Lecture 18: AI & Society
This lecture argues that AI developers bear unique ethical responsibility for societal outcomes, framing AI as a dual-use technology that requires active steering toward beneficial applications while preventing misuse and accidental harms through rigorous auditing and an ecosystem-aware approach.
Stanford CS221 | Autumn 2025 | Lecture 16: Logic II
This lecture introduces First Order Logic as a powerful extension of propositional logic that uses objects, predicates, functions, and quantifiers to compactly represent complex relationships and generalizations without enumerating every possible instance.