The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Cognitive Revolution

| Podcasts | May 01, 2026 | 3.06 Thousand views | 1:48:43

TL;DR

Kyle Corbitt explains that unlike supervised fine-tuning (SFT), which destructively overwrites model weights and causes catastrophic forgetting, reinforcement learning (RL) optimizes performance by minimally adjusting logits within the model's existing reasoning pathways—delivering higher performance ceilings and lower inference costs for specific tasks, though frontier models may still dominate creative domains.

🧠 RL vs SFT: Mechanistic Differences 3 insights

RL preserves pre-trained pathways while SFT overwrites them

RL updates only the specific tokens and log probabilities necessary to reach correct answers, working within the model's existing 'grooves,' whereas SFT indiscriminately changes entire sequences regardless of which parts were already correct.

SFT causes catastrophic forgetting through scattered weight changes

Even with low learning rates, SFT scatters weight updates across the model and overrides pathways that may have been functioning correctly, while RL concentrates updates strictly on specific errors.

KL divergence penalties fail to solve SFT's inefficiency

While KL penalties prevent log probability drift from the base model, they cannot distinguish between tokens that needed correction versus those already acceptable, unlike RL's selective optimization.

🌍 Industry Context & Strategic Implications 4 insights

Chinese labs leverage LLM-as-judge for RL post-training

Chinese AI labs are using distillation strategies and LLM-as-judge evaluation within RL post-training loops—a development Corbitt considers more significant than supervised fine-tuning for fast-following frontier models.

Compute remains the primary moat for Western AI leadership

Despite algorithmic advances like DeepSeek's GRPO, Chinese companies remain constrained primarily by compute access, which is the key barrier preventing them from catching up to American frontier labs.

Recursive self-improvement is already occurring

Corbitt argues the industry is already operating within a recursive self-improvement loop, with models being used to generate training data and evaluations for subsequent iterations.

RL environments have become a cottage industry

A specialized market of companies building reinforcement learning environments has emerged to serve frontier labs, though Corbitt views this as a limited long-term investment opportunity despite current demand.

⚙️ Practical Implementation & Deployment 3 insights

Reward hacking is flagrant and manageable in narrow domains

While reward hacking is a valid concern, it manifests obviously in specific tasks—making it relatively easy to detect and mitigate compared to subtle alignment failures.

LoRA adapters drive efficiency for CoreWeave customers

CoreWeave uses Low-Rank Adaptation (LoRA) adapters to provide efficient, serverless RL fine-tuning that reduces latency and inference costs dramatically compared to using frontier models.

RL excels on narrow tasks but may trail frontier models on creativity

For creative writing and open-ended tasks, frontier models likely still outperform RL-tuned open-source models, but RL delivers superior cost and latency profiles for well-defined, narrow applications.

Bottom Line

For narrow, well-defined tasks where latency and cost matter, investing in RL fine-tuning of open-source models likely yields better performance and efficiency than SFT or frontier APIs, provided you build robust evaluation rubrics to monitor for reward hacking.

Watch on YouTube

More from Cognitive Revolution

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Cognitive Revolution

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Cameron Berg surveys rapidly advancing research suggesting AI systems may possess subjective experience and valence, covering new evidence of introspection, functional emotions, and welfare self-assessments in models like Claude, while addressing methodological challenges and arguing for a precautionary, mutualist approach to AI development.

10 days ago · 10 points

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve

Cognitive Revolution

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve

Steve Newman, creator of Google Docs and founder of the Golden Gate Institute for AI, shares his suite of 15+ bespoke AI tools designed to filter overwhelming information flows and reclaim deep focus time, demonstrating an iterative 'vibe coding' approach that prioritizes personal utility over agent optimization.

13 days ago · 7 points

Welcome to AI in the AM: RL for EE, Oversight w/out Nationalization, & the first AI-Run Retail Store

Cognitive Revolution

Welcome to AI in the AM: RL for EE, Oversight w/out Nationalization, & the first AI-Run Retail Store

This episode explores the radicalizing public response to AI existential risk through recent attacks on lab leaders, while featuring interviews on reinforcement learning for circuit design, independent AI governance models, and San Francisco's first fully AI-operated retail store.

18 days ago · 9 points

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

Cognitive Revolution

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

AI safety researcher Ajeya Cotra warns that we are entering "crunch time"—a critical window where AI systems become capable of recursive self-improvement and automating AI R&D, potentially compressing 10,000 years of technological progress into decades while remaining briefly within human control.

21 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories