Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence

| Podcasts | May 11, 2026 | 895 views | 57:57

TL;DR

Shrimai Prabhumoye presents advanced LLM pre-training strategies from her work at Nvidia, demonstrating that curriculum learning (two-phase training) and front-loading reasoning data during pre-training create stronger foundations and durable performance gains that cannot be matched by increased compute in later stages.

🗂️ Data Curation & Curriculum Learning 2 insights

Two-phase pre-training maximizes data potential

A curriculum approach where phase one emphasizes diverse, lower-quality web data and phase two focuses on high-quality sources (math, code, Wikipedia) with repeated epochs yields 17% better performance than naive training and 3.4% better than optimal blending without ordering.

Quality and epoch estimation prevent diminishing returns

Building quality classifiers and estimating optimal repetition rates for each data source before creating blends ensures high-value tokens are fully utilized without overfitting, addressing projections that LLMs will exhaust 95% of human-generated data by 2030.

đź§  Front-Loading Reasoning 3 insights

Reasoning should be a foundation, not a post-hoc addition

Injecting reasoning data during pre-training creates a 'reason base' model, unlike the standard pipeline that treats reasoning as a post-training fix; this approach yields 16% better performance immediately after pre-training and 9.3% better after SFT.

High-quality pre-training data unlocks hidden post-training gains

While mixing high-quality (SHQ) and low-quality (LDQ) reasoning data showed no benefit immediately after pre-training, models with both achieved a 4.25% boost after SFT, indicating early exposure to quality reasoning primes the model for better refinement later.

Early reasoning advantages are durable and compute-efficient

Models trained with reasoning data during pre-training maintained a 19% average advantage (39% on complex math benchmarks like AIME) after full post-training, and could not be matched by doubling SFT compute or reallocating all reasoning data to post-training.

Bottom Line

To build state-of-the-art reasoning models, allocate high-quality reasoning data to pre-training rather than reserving it solely for fine-tuning, and implement a two-phase curriculum that prioritizes data diversity before intensive high-quality repetition.

More from Stanford Online

View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
1:02:32
Stanford Online Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics

This seminar outlines a roadmap for autonomous surgical robotics to address critical healthcare labor shortages, proposing a physics-based approach built on four pillars—perception, modeling, planning, and control—that achieves sub-2mm precision through real-time digital twinning rather than relying on data-scarce foundation models.

about 6 hours ago · 7 points
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
1:25:30
Stanford Online Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference

Inference now dominates AI economics, with OpenAI generating 8.6 trillion tokens daily—exceeding frontier model training compute in under four days. Unlike training, autoregressive inference cannot parallelize across sequences, making it fundamentally memory-bandwidth bound rather than compute bound, with batch sizes under 295 on H100s failing to saturate GPU capacity.

about 7 hours ago · 9 points
Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs
1:01:48
Stanford Online Stanford Online

Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs

Nuaman Tazzy from HuggingFace explains how to scale transformer training to thousands of GPUs using data parallelism strategies, from basic Distributed Data Parallel (DDP) to Fully Sharded Data Parallel (FSDP/ZeRO), emphasizing memory optimization techniques and the critical importance of overlapping communication with computation to keep GPUs fully utilized.

about 7 hours ago · 9 points