Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

| Podcasts | March 09, 2026 | 4.19 Thousand views | 1:19:46

TL;DR

This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.

🏭 Industrial Scale of Modern LLMs 3 insights

Massive training datasets

Qwen 3 trains on 36 trillion tokens (~144 TB of text), equivalent to 90 billion sheets of paper stacked 9,000 km high—far exceeding the ISS orbital altitude of 400 km.

Extreme compute costs

Training Llama 3 requires ~3.9×10²⁵ FLOPs, costing approximately $42 million using H100 GPUs and taking 880,000 days on a single GPU versus 650,000 years on a MacBook.

Space-based infrastructure

Major tech companies including Google and Nvidia are actively exploring space-based data centers and GPU deployments to meet the unsustainable computational demands of future models.

📝 Fundamental Nature of Language 3 insights

Structure and vocabulary

Language models capture the statistical structure of sequences comprising vocabulary (allowable symbols) and grammar (combination rules) across natural languages, code, and sign language.

Probabilistic worldview encoding

When completing 'investors celebrated' versus 'investors panicked' after a market crash, the model reveals implicit beliefs and statistical frequencies derived from training data rather than objective truth.

Distribution over sequences

A language model fundamentally represents a probability distribution P(sequence) over all possible character combinations, assigning higher likelihoods to grammatically and semantically coherent continuations.

⚙️ Technical Architecture 3 insights

Tensor operations

Models process batched input embeddings (B×T×D matrices) through neural networks to output probability distributions (B×T×V) over the vocabulary at each position.

Auto-regressive generation

The model predicts one token, appends it to the input sequence, and repeats the process, enabling open-ended text completion through either greedy decoding or probabilistic sampling.

Chain rule decomposition

Joint probabilities factor into products of conditional probabilities P(word|previous context), making next-token prediction mathematically equivalent to modeling complete sequences.

🎓 Training Objectives 3 insights

Next-token prediction dominance

All major industrial models (GPT, Llama, Qwen) use next-token prediction as their primary pre-training objective, treating language modeling as multiclass classification over the vocabulary.

Masked language modeling

Alternative approaches predict missing interior tokens (e.g., filling 'import ___ as numpy'), though next-token prediction remains the universal standard for large-scale pre-training.

Batching for scale

Modern implementations process multiple sequences simultaneously while maintaining sequential dependencies, converting the fundamental operation into efficient high-throughput matrix computation.

Bottom Line

Language models are massive industrial artifacts that compress trillions of tokens into probability distributions, using next-token prediction to encode grammar, semantics, and implicit world knowledge through purely statistical auto-regressive training.

More from Stanford Online

View all
Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
1:01:14
Stanford Online Stanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, argues that visual intelligence represents the critical next frontier for AI, requiring a fundamental shift from text-centric unimodal models to multimodal systems trained on 'natural representations' (video, audio, physics) to unlock true reasoning, robotics capabilities, and higher intelligence.

5 days ago · 9 points