Stanford CS221 | Autumn 2025 | Lecture 17: Language Models
TL;DR
This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.
🏭 Industrial Scale of Modern LLMs 3 insights
Massive training datasets
Qwen 3 trains on 36 trillion tokens (~144 TB of text), equivalent to 90 billion sheets of paper stacked 9,000 km high—far exceeding the ISS orbital altitude of 400 km.
Extreme compute costs
Training Llama 3 requires ~3.9×10²⁵ FLOPs, costing approximately $42 million using H100 GPUs and taking 880,000 days on a single GPU versus 650,000 years on a MacBook.
Space-based infrastructure
Major tech companies including Google and Nvidia are actively exploring space-based data centers and GPU deployments to meet the unsustainable computational demands of future models.
📝 Fundamental Nature of Language 3 insights
Structure and vocabulary
Language models capture the statistical structure of sequences comprising vocabulary (allowable symbols) and grammar (combination rules) across natural languages, code, and sign language.
Probabilistic worldview encoding
When completing 'investors celebrated' versus 'investors panicked' after a market crash, the model reveals implicit beliefs and statistical frequencies derived from training data rather than objective truth.
Distribution over sequences
A language model fundamentally represents a probability distribution P(sequence) over all possible character combinations, assigning higher likelihoods to grammatically and semantically coherent continuations.
⚙️ Technical Architecture 3 insights
Tensor operations
Models process batched input embeddings (B×T×D matrices) through neural networks to output probability distributions (B×T×V) over the vocabulary at each position.
Auto-regressive generation
The model predicts one token, appends it to the input sequence, and repeats the process, enabling open-ended text completion through either greedy decoding or probabilistic sampling.
Chain rule decomposition
Joint probabilities factor into products of conditional probabilities P(word|previous context), making next-token prediction mathematically equivalent to modeling complete sequences.
🎓 Training Objectives 3 insights
Next-token prediction dominance
All major industrial models (GPT, Llama, Qwen) use next-token prediction as their primary pre-training objective, treating language modeling as multiclass classification over the vocabulary.
Masked language modeling
Alternative approaches predict missing interior tokens (e.g., filling 'import ___ as numpy'), though next-token prediction remains the universal standard for large-scale pre-training.
Batching for scale
Modern implementations process multiple sequences simultaneously while maintaining sequential dependencies, converting the fundamental operation into efficient high-throughput matrix computation.
Bottom Line
Language models are massive industrial artifacts that compress trillions of tokens into probability distributions, using next-token prediction to encode grammar, semantics, and implicit world knowledge through purely statistical auto-regressive training.
More from Stanford Online
View all
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI
Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories
Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.
AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks
Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.
Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything
Sam Altman explains how AI has fundamentally altered startup economics, enabling small teams to achieve unprecedented scale, while sharing OpenAI's journey from research lab to product company and arguing that pushing systems beyond conventional scaling limits often reveals emergent properties that consensus thinking misses.