Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
TL;DR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
🎯 The Shift to Verifiable Rewards 3 insights
RLHF hits an overoptimization wall
Training against learned reward models inevitably overfits to preference data, creating a hard annotation bottleneck that limits how much compute can be applied.
Verifiable rewards unlock unlimited scaling
Hard verification signals like math correctness or code execution provide ground-truth objectives similar to AlphaGo, allowing RL to scale indefinitely without overfitting.
Breakthroughs in thinking models
OpenAI's recent solutions to open math problems demonstrate how RLVR enables extended reasoning chains through long-context training on verifiable objectives.
⚠️ PPO Implementation Challenges 3 insights
Complexity and sensitivity
PPO requires navigating dozens of implementation details and hyperparameters, with small engineering choices drastically altering optimization outcomes.
The value network burden
PPO requires training a separate value model as large as the policy itself, consuming significant memory that could otherwise support larger models or inference.
Common degenerate configurations
Many practitioners unknowingly reduce PPO to a bandit algorithm by setting gamma=lambda=1, destroying the temporal structure the algorithm was designed to capture.
🚀 GRPO: The Simpler Alternative 3 insights
Eliminating the value network
GRPO removes PPO's most complex component by estimating advantages as z-scores across groups of outputs sampled from the same prompt.
Group-based relative advantage
Instead of comparing rewards to a learned value function, GRPO computes relative performance by comparing each output against the mean and standard deviation of its peer group.
The open-source standard
Originally introduced by DeepSeek, GRPO has become the dominant algorithm for post-training on verifiable tasks due to its simplicity and reduced memory requirements.
Bottom Line
For verifiable reasoning tasks like mathematics and coding, use GRPO instead of PPO to eliminate complex value networks and enable scalable reinforcement learning through group-based relative advantage estimation.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
This lecture details the pre-training data pipeline, covering the transformation of raw HTML and PDFs into linear text and classifier-based filtering strategies to curate domain-specific datasets, while emphasizing the strategic trade-off between data quality and training duration.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Sachin Katti, OpenAI's head of industrial compute, details the infrastructure economics driving the AI supercycle, explaining how the company plans to scale to 30 gigawatts by 2030 while navigating the shift from training to inference-heavy agentic workloads and managing massive energy and supply chain constraints.
Stanford CS25: Transformers United V6 I Advancing Science and Medicine with Collaborative AI Agents
Google DeepMind researcher Vivek Natarajan discusses the development of Co-Scientist, an AI system designed to act as a collaborative partner for scientific discovery by moving beyond fast System 1 thinking to rigorous System 2 reasoning, emphasizing that true scientific AI requires the generality of human cognition rather than narrow specialization.