The "secret sauce" of recent AI breakthroughs: Post-training with RLVR (and RLHF) | Lex Fridman

| Podcasts | February 06, 2026 | 20.8 Thousand views | 21:15

TL;DR

Recent AI breakthroughs in reasoning models stem from Reinforcement Learning with Verifiable Rewards (RLVR), which trains models by rewarding accurate solutions to objectively checkable problems like math and coding, enabling scalable performance gains through iterative trial-and-error rather than human preference optimization.

🎯 RLVR: The New Training Paradigm 3 insights

Verifiable rewards replace learned preferences

Unlike RLHF which optimizes for aggregate human preferences, RLVR uses automatically verifiable outcomes—such as math accuracy or code correctness—as direct rewards, enabling optimization at much larger scales without human labeling.

The generate-grade loop mechanism

Models generate multiple solution attempts and receive rewards based on objective accuracy, creating a reinforcement learning environment where the model learns from trial-and-error on verifiable tasks without human intervention.

Expanding beyond deterministic domains

While math and coding provide clear verification signals, researchers are extending RLVR to open-ended domains using 'rubrics' or LM-as-a-judge frameworks to evaluate more subjective scientific and reasoning problems.

🧠 Emergent Reasoning Behaviors 3 insights

Spontaneous 'aha moments' and self-correction

RLVR training causes models to naturally develop step-by-step reasoning chains and recognize their own errors mid-generation, spontaneously verbalizing mistakes ('Wait, I need to reconsider') without explicit programming of this behavior.

Inference time scaling emerges naturally

Models trained with RLVR inherently learn to use more tokens and 'think longer' to solve problems, with response lengths growing naturally during training as the model discovers that extended reasoning improves final accuracy.

Unlocking existing knowledge versus learning

Experiments show RLVR can rapidly improve accuracy (e.g., 15% to 50% in 50 training steps) by unlocking capabilities already present from pre-training rather than teaching fundamentally new mathematical knowledge.

Compute and Scaling Properties 3 insights

Post-training compute rivals pre-training

Modern RLVR runs approach pre-training in total GPU hours—Grok 4 reportedly used similar compute for post-training as pre-training—though RLVR is memory-bound (due to generating long sequences) rather than compute-bound like pre-training.

RLVR follows scaling laws; RLHF plateaus

RLVR demonstrates predictable scaling where logarithmic increases in training compute yield linear performance gains, whereas RLHF quickly reaches diminishing returns since preference tuning merely averages stylistic differences rather than solving harder problems.

The three-stage training recipe

Effective reasoning models require sequential intervention: mid-training with curated reasoning traces to establish skills, RLVR for intensive trial-and-error learning on verifiable problems, and RLHF as a final polish for style and tone.

🔬 Contamination Concerns and Future Directions 2 insights

Data contamination debates cloud benchmarks

Researchers debate whether RLVR gains reflect true reasoning or memorization, citing cases like Quen-3 where models achieve suspiciously high precision on specific benchmarks, suggesting potential exposure to similar problems during pre-training.

The path to RLVR 2.0

Future iterations will incorporate process reward models and value functions to grade intermediate reasoning steps rather than just final answers, moving beyond simple question-answer verification to optimize the quality of the thinking process itself.

Bottom Line

Organizations should prioritize RLVR over RLHF for developing advanced reasoning capabilities, allocating significant compute resources to verifiable domains like math and code while implementing rigorous contamination checks to ensure models are truly reasoning rather than recalling memorized solutions.

More from This Week in Startups (Jason Calacanis)

View all
Timeline to AGI: When will superhuman AI be created? | Lex Fridman Podcast
22:08
This Week in Startups (Jason Calacanis) This Week in Startups (Jason Calacanis)

Timeline to AGI: When will superhuman AI be created? | Lex Fridman Podcast

The conversation contrasts the "AI 2027" report's milestone-based path to AGI (superhuman coder → researcher → ASI by 2031) with the "jagged capabilities" view, concluding that while AI will automate significant software development tasks within months, fully autonomous research and general computer use remain distant due to specification challenges and uneven capability profiles.

about 2 months ago · 9 points
Advice for beginners in AI: How to learn and what to build | Lex Fridman Podcast
30:57
This Week in Startups (Jason Calacanis) This Week in Startups (Jason Calacanis)

Advice for beginners in AI: How to learn and what to build | Lex Fridman Podcast

Aspiring AI researchers should build small language models from scratch to master fundamentals, then specialize deeply in narrow areas like RLHF or character training, while carefully weighing the trade-offs between academia's intellectual freedom and frontier labs' high compensation but intense 996 work culture.

about 2 months ago · 10 points