Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

Stanford Online

| Podcasts | March 09, 2026 | 4.19 Thousand views | 1:19:46

TL;DR

This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.

🏭 Industrial Scale of Modern LLMs 3 insights

Massive training datasets

Qwen 3 trains on 36 trillion tokens (~144 TB of text), equivalent to 90 billion sheets of paper stacked 9,000 km high—far exceeding the ISS orbital altitude of 400 km.

Extreme compute costs

Training Llama 3 requires ~3.9×10²⁵ FLOPs, costing approximately $42 million using H100 GPUs and taking 880,000 days on a single GPU versus 650,000 years on a MacBook.

Space-based infrastructure

Major tech companies including Google and Nvidia are actively exploring space-based data centers and GPU deployments to meet the unsustainable computational demands of future models.

📝 Fundamental Nature of Language 3 insights

Structure and vocabulary

Language models capture the statistical structure of sequences comprising vocabulary (allowable symbols) and grammar (combination rules) across natural languages, code, and sign language.

Probabilistic worldview encoding

When completing 'investors celebrated' versus 'investors panicked' after a market crash, the model reveals implicit beliefs and statistical frequencies derived from training data rather than objective truth.

Distribution over sequences

A language model fundamentally represents a probability distribution P(sequence) over all possible character combinations, assigning higher likelihoods to grammatically and semantically coherent continuations.

⚙️ Technical Architecture 3 insights

Tensor operations

Models process batched input embeddings (B×T×D matrices) through neural networks to output probability distributions (B×T×V) over the vocabulary at each position.

Auto-regressive generation

The model predicts one token, appends it to the input sequence, and repeats the process, enabling open-ended text completion through either greedy decoding or probabilistic sampling.

Chain rule decomposition

Joint probabilities factor into products of conditional probabilities P(word|previous context), making next-token prediction mathematically equivalent to modeling complete sequences.

🎓 Training Objectives 3 insights

Next-token prediction dominance

All major industrial models (GPT, Llama, Qwen) use next-token prediction as their primary pre-training objective, treating language modeling as multiclass classification over the vocabulary.

Masked language modeling

Alternative approaches predict missing interior tokens (e.g., filling 'import ___ as numpy'), though next-token prediction remains the universal standard for large-scale pre-training.

Batching for scale

Modern implementations process multiple sequences simultaneously while maintaining sequential dependencies, converting the fundamental operation into efficient high-throughput matrix computation.

Bottom Line

Language models are massive industrial artifacts that compress trillions of tokens into probability distributions, using next-token prediction to encode grammar, semantics, and implicit world knowledge through purely statistical auto-regressive training.

Watch on YouTube

More from Stanford Online

Stanford CS153 Frontier Systems | Nikhyl Singhal from Skip on Product Management in the AI Era

Stanford Online

Stanford CS153 Frontier Systems | Nikhyl Singhal from Skip on Product Management in the AI Era

Nikhyl Singhal argues that product management is evolving from manual information gathering to AI-augmented strategic judgment, requiring PMs to focus on solving genuine customer problems while leveraging AI's ability to synthesize vast customer data streams.

2 days ago · 10 points

Stanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems

Stanford Online

Stanford CS153 Frontier Systems | Amit Jain from Luma AI on Unified Intelligence Systems

Amit Jain details Luma AI's evolution from 3D capture to video generation, revealing how the company learned to build scalable world simulators by designing algorithms around data physics rather than theoretical ideals, ultimately converging on unified intelligence systems that combine language, video, and reasoning.

3 days ago · 10 points

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

Stanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, argues that visual intelligence represents the critical next frontier for AI, requiring a fundamental shift from text-centric unimodal models to multimodal systems trained on 'natural representations' (video, audio, physics) to unlock true reasoning, robotics capabilities, and higher intelligence.

5 days ago · 9 points

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

Stanford Online

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

ElevenLabs CEO Mati Staniszewski explains how the company pivoted from an AI dubbing vision to perfecting text-to-speech by staying close to Discord communities, leveraging open-source research, and running lean to solve the 'one voice' dubbing problem he experienced growing up in Poland.

5 days ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories