The "secret sauce" of recent AI breakthroughs: Post-training with RLVR (and RLHF) | Lex Fridman

This Week in Startups (Jason Calacanis)

| Podcasts | February 06, 2026 | 21 Thousand views | 21:15

TL;DR

Recent AI breakthroughs in reasoning models stem from Reinforcement Learning with Verifiable Rewards (RLVR), which trains models by rewarding accurate solutions to objectively checkable problems like math and coding, enabling scalable performance gains through iterative trial-and-error rather than human preference optimization.

🎯 RLVR: The New Training Paradigm 3 insights

Verifiable rewards replace learned preferences

Unlike RLHF which optimizes for aggregate human preferences, RLVR uses automatically verifiable outcomes—such as math accuracy or code correctness—as direct rewards, enabling optimization at much larger scales without human labeling.

The generate-grade loop mechanism

Models generate multiple solution attempts and receive rewards based on objective accuracy, creating a reinforcement learning environment where the model learns from trial-and-error on verifiable tasks without human intervention.

Expanding beyond deterministic domains

While math and coding provide clear verification signals, researchers are extending RLVR to open-ended domains using 'rubrics' or LM-as-a-judge frameworks to evaluate more subjective scientific and reasoning problems.

🧠 Emergent Reasoning Behaviors 3 insights

Spontaneous 'aha moments' and self-correction

RLVR training causes models to naturally develop step-by-step reasoning chains and recognize their own errors mid-generation, spontaneously verbalizing mistakes ('Wait, I need to reconsider') without explicit programming of this behavior.

Inference time scaling emerges naturally

Models trained with RLVR inherently learn to use more tokens and 'think longer' to solve problems, with response lengths growing naturally during training as the model discovers that extended reasoning improves final accuracy.

Unlocking existing knowledge versus learning

Experiments show RLVR can rapidly improve accuracy (e.g., 15% to 50% in 50 training steps) by unlocking capabilities already present from pre-training rather than teaching fundamentally new mathematical knowledge.

⚡ Compute and Scaling Properties 3 insights

Post-training compute rivals pre-training

Modern RLVR runs approach pre-training in total GPU hours—Grok 4 reportedly used similar compute for post-training as pre-training—though RLVR is memory-bound (due to generating long sequences) rather than compute-bound like pre-training.

RLVR follows scaling laws; RLHF plateaus

RLVR demonstrates predictable scaling where logarithmic increases in training compute yield linear performance gains, whereas RLHF quickly reaches diminishing returns since preference tuning merely averages stylistic differences rather than solving harder problems.

The three-stage training recipe

Effective reasoning models require sequential intervention: mid-training with curated reasoning traces to establish skills, RLVR for intensive trial-and-error learning on verifiable problems, and RLHF as a final polish for style and tone.

🔬 Contamination Concerns and Future Directions 2 insights

Data contamination debates cloud benchmarks

Researchers debate whether RLVR gains reflect true reasoning or memorization, citing cases like Quen-3 where models achieve suspiciously high precision on specific benchmarks, suggesting potential exposure to similar problems during pre-training.

The path to RLVR 2.0

Future iterations will incorporate process reward models and value functions to grade intermediate reasoning steps rather than just final answers, moving beyond simple question-answer verification to optimize the quality of the thinking process itself.

Bottom Line

Organizations should prioritize RLVR over RLHF for developing advanced reasoning capabilities, allocating significant compute resources to verifiable domains like math and code while implementing rigorous contamination checks to ensure models are truly reasoning rather than recalling memorized solutions.

Watch on YouTube

More from This Week in Startups (Jason Calacanis)

The limits of AI scaling laws - NVIDIA CEO explains | Jensen Huang and Lex Fridman

This Week in Startups (Jason Calacanis)

The limits of AI scaling laws - NVIDIA CEO explains | Jensen Huang and Lex Fridman

Jensen Huang explains that AI progress is now driven by four simultaneous scaling laws (pre-training, post-training, test-time, and agentic), with synthetic data eliminating previous data scarcity concerns and making compute the sole limiting factor for intelligence growth.

3 months ago · 10 points

Origin story of OpenClaw: From 1-hour prototype to 180,000 stars of GitHub | Peter Steinberger

This Week in Startups (Jason Calacanis)

Origin story of OpenClaw: From 1-hour prototype to 180,000 stars of GitHub | Peter Steinberger

Peter Steinberger explains how a 1-hour WhatsApp-to-CLI prototype evolved into OpenClaw, the fastest-growing GitHub repository in history (175,000+ stars), by creating a self-modifying AI agent that prioritizes fun and accessibility over corporate polish.

4 months ago · 9 points

How to code with AI agents - Advice from OpenClaw creator | Peter Steinberger and Lex Fridman

This Week in Startups (Jason Calacanis)

How to code with AI agents - Advice from OpenClaw creator | Peter Steinberger and Lex Fridman

Steinberger details his evolution to an 'agentic engineering' workflow using multiple CLI-based AI agents simultaneously, arguing that mastery requires developing empathy for how agents perceive limited context while embracing imperfection and concise prompts over complex orchestration.

4 months ago · 10 points

Timeline to AGI: When will superhuman AI be created? | Lex Fridman Podcast

This Week in Startups (Jason Calacanis)

Timeline to AGI: When will superhuman AI be created? | Lex Fridman Podcast

The conversation contrasts the "AI 2027" report's milestone-based path to AGI (superhuman coder → researcher → ASI by 2031) with the "jagged capabilities" view, concluding that while AI will automate significant software development tasks within months, fully autonomous research and general computer use remain distant due to specification challenges and uneven capability profiles.

5 months ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories