Stanford CS25: Transformers United V6 I Overview of Transformers

Stanford Online

| Podcasts | April 22, 2026 | 6.96 Thousand views | 1:16:46

TL;DR

Stanford's CS25 introductory lecture traces the evolution from hand-engineered features to Transformer architectures, explaining how self-attention mechanisms enable parallel processing and long-context modeling, while exploring how billion-parameter language models develop emergent reasoning capabilities through next-token prediction on internet-scale data.

📈 Evolution of ML Architectures 3 insights

From hand-crafted features to raw data

Pre-2012 machine learning relied on expensive hand-labeled datasets and engineered features until deep learning enabled models to process raw data directly for end-to-end prediction.

Self-supervised learning reduces labeling costs

Modern models learn general representations by reconstructing corrupted or masked data, eliminating the need for manual annotations while creating reusable knowledge for downstream tasks.

Next-token prediction maximizes data utility

Language modeling shifted from simple classification to predicting subsequent tokens in sequences, allowing models to leverage massive unlabeled text corpora for training.

🧠 Transformer Mechanics 4 insights

Self-attention uses Query-Key-Value matching

The mechanism learns token relationships by computing query-key matches to retrieve relevant values, analogous to searching a library by matching search terms against book summaries.

Parallel processing enables GPU acceleration

Unlike sequential RNNs and LSTMs, transformers process entire sequences simultaneously, supporting efficient training on GPUs and handling contexts of millions of tokens.

Positional encodings preserve sequence order

Since attention mechanisms inherently lack positional awareness, embeddings that encode token positions are added to maintain understanding of word ordering and sequence structure.

Multi-headed attention captures diverse relationships

Multiple attention matrices operate in parallel to detect different types of connections between tokens, creating richer representations than single attention mechanisms.

⚡ Modern LLM Training and Scale 3 insights

Scale triggers emergent abilities

When scaled to billions of parameters, transformers develop unexpected capabilities such as mathematical reasoning and complex problem-solving despite training only on next-token prediction objectives.

Internet-scale pre-training drives generalization

Models trained on diverse internet text learn statistical distributions of human language, enabling zero-shot and few-shot performance on tasks never explicitly seen during training.

Baby LM explores data efficiency

Research initiatives compare how small models trained on limited, human-like interactive and multimodal data perform against traditional models trained on massive internet text corpora.

Bottom Line

Transformers have replaced RNNs by using parallelizable self-attention to process long contexts efficiently, while modern LLMs leverage massive internet-scale pre-training to develop emergent reasoning capabilities through simple next-token prediction.

Watch on YouTube

More from Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play

Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play

The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.

3 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI

Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.

3 days ago · 10 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu

Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.

3 days ago · 8 points

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.

4 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories