The "secret sauce" of recent AI breakthroughs: Post-training with RLVR (and RLHF) | Lex Fridman
TL;DR
Recent AI breakthroughs in reasoning models stem from Reinforcement Learning with Verifiable Rewards (RLVR), which trains models by rewarding accurate solutions to objectively checkable problems like math and coding, enabling scalable performance gains through iterative trial-and-error rather than human preference optimization.
🎯 RLVR: The New Training Paradigm 3 insights
Verifiable rewards replace learned preferences
Unlike RLHF which optimizes for aggregate human preferences, RLVR uses automatically verifiable outcomes—such as math accuracy or code correctness—as direct rewards, enabling optimization at much larger scales without human labeling.
The generate-grade loop mechanism
Models generate multiple solution attempts and receive rewards based on objective accuracy, creating a reinforcement learning environment where the model learns from trial-and-error on verifiable tasks without human intervention.
Expanding beyond deterministic domains
While math and coding provide clear verification signals, researchers are extending RLVR to open-ended domains using 'rubrics' or LM-as-a-judge frameworks to evaluate more subjective scientific and reasoning problems.
🧠 Emergent Reasoning Behaviors 3 insights
Spontaneous 'aha moments' and self-correction
RLVR training causes models to naturally develop step-by-step reasoning chains and recognize their own errors mid-generation, spontaneously verbalizing mistakes ('Wait, I need to reconsider') without explicit programming of this behavior.
Inference time scaling emerges naturally
Models trained with RLVR inherently learn to use more tokens and 'think longer' to solve problems, with response lengths growing naturally during training as the model discovers that extended reasoning improves final accuracy.
Unlocking existing knowledge versus learning
Experiments show RLVR can rapidly improve accuracy (e.g., 15% to 50% in 50 training steps) by unlocking capabilities already present from pre-training rather than teaching fundamentally new mathematical knowledge.
⚡ Compute and Scaling Properties 3 insights
Post-training compute rivals pre-training
Modern RLVR runs approach pre-training in total GPU hours—Grok 4 reportedly used similar compute for post-training as pre-training—though RLVR is memory-bound (due to generating long sequences) rather than compute-bound like pre-training.
RLVR follows scaling laws; RLHF plateaus
RLVR demonstrates predictable scaling where logarithmic increases in training compute yield linear performance gains, whereas RLHF quickly reaches diminishing returns since preference tuning merely averages stylistic differences rather than solving harder problems.
The three-stage training recipe
Effective reasoning models require sequential intervention: mid-training with curated reasoning traces to establish skills, RLVR for intensive trial-and-error learning on verifiable problems, and RLHF as a final polish for style and tone.
🔬 Contamination Concerns and Future Directions 2 insights
Data contamination debates cloud benchmarks
Researchers debate whether RLVR gains reflect true reasoning or memorization, citing cases like Quen-3 where models achieve suspiciously high precision on specific benchmarks, suggesting potential exposure to similar problems during pre-training.
The path to RLVR 2.0
Future iterations will incorporate process reward models and value functions to grade intermediate reasoning steps rather than just final answers, moving beyond simple question-answer verification to optimize the quality of the thinking process itself.
Bottom Line
Organizations should prioritize RLVR over RLHF for developing advanced reasoning capabilities, allocating significant compute resources to verifiable domains like math and code while implementing rigorous contamination checks to ensure models are truly reasoning rather than recalling memorized solutions.
More from This Week in Startups (Jason Calacanis)
View all
Origin story of OpenClaw: From 1-hour prototype to 180,000 stars of GitHub | Peter Steinberger
Peter Steinberger explains how a 1-hour WhatsApp-to-CLI prototype evolved into OpenClaw, the fastest-growing GitHub repository in history (175,000+ stars), by creating a self-modifying AI agent that prioritizes fun and accessibility over corporate polish.
How to code with AI agents - Advice from OpenClaw creator | Peter Steinberger and Lex Fridman
Steinberger details his evolution to an 'agentic engineering' workflow using multiple CLI-based AI agents simultaneously, arguing that mastery requires developing empathy for how agents perceive limited context while embracing imperfection and concise prompts over complex orchestration.
Timeline to AGI: When will superhuman AI be created? | Lex Fridman Podcast
The conversation contrasts the "AI 2027" report's milestone-based path to AGI (superhuman coder → researcher → ASI by 2031) with the "jagged capabilities" view, concluding that while AI will automate significant software development tasks within months, fully autonomous research and general computer use remain distant due to specification challenges and uneven capability profiles.
Advice for beginners in AI: How to learn and what to build | Lex Fridman Podcast
Aspiring AI researchers should build small language models from scratch to master fundamentals, then specialize deeply in narrow areas like RLHF or character training, while carefully weighing the trade-offs between academia's intellectual freedom and frontier labs' high compensation but intense 996 work culture.