Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR

Stanford Online

| Podcasts | May 27, 2026 | 6.95 Thousand views | 1:15:51

TL;DR

This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.

🎯 The Shift to Verifiable Rewards 3 insights

RLHF hits an overoptimization wall

Training against learned reward models inevitably overfits to preference data, creating a hard annotation bottleneck that limits how much compute can be applied.

Verifiable rewards unlock unlimited scaling

Hard verification signals like math correctness or code execution provide ground-truth objectives similar to AlphaGo, allowing RL to scale indefinitely without overfitting.

Breakthroughs in thinking models

OpenAI's recent solutions to open math problems demonstrate how RLVR enables extended reasoning chains through long-context training on verifiable objectives.

⚠️ PPO Implementation Challenges 3 insights

Complexity and sensitivity

PPO requires navigating dozens of implementation details and hyperparameters, with small engineering choices drastically altering optimization outcomes.

The value network burden

PPO requires training a separate value model as large as the policy itself, consuming significant memory that could otherwise support larger models or inference.

Common degenerate configurations

Many practitioners unknowingly reduce PPO to a bandit algorithm by setting gamma=lambda=1, destroying the temporal structure the algorithm was designed to capture.

🚀 GRPO: The Simpler Alternative 3 insights

Eliminating the value network

GRPO removes PPO's most complex component by estimating advantages as z-scores across groups of outputs sampled from the same prompt.

Group-based relative advantage

Instead of comparing rewards to a learned value function, GRPO computes relative performance by comparing each output against the mean and standard deviation of its peer group.

The open-source standard

Originally introduced by DeepSeek, GRPO has become the dominant algorithm for post-training on verifiable tasks due to its simplicity and reduced memory requirements.

Bottom Line

For verifiable reasoning tasks like mathematics and coding, use GRPO instead of PPO to eliminate complex value networks and enable scalable reinforcement learning through group-based relative advantage estimation.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

4 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

18 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

25 days ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

26 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories