The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

| Podcasts | May 01, 2026 | 3.06 Thousand views | 1:48:43

TL;DR

Kyle Corbitt explains that unlike supervised fine-tuning (SFT), which destructively overwrites model weights and causes catastrophic forgetting, reinforcement learning (RL) optimizes performance by minimally adjusting logits within the model's existing reasoning pathways—delivering higher performance ceilings and lower inference costs for specific tasks, though frontier models may still dominate creative domains.

🧠 RL vs SFT: Mechanistic Differences 3 insights

RL preserves pre-trained pathways while SFT overwrites them

RL updates only the specific tokens and log probabilities necessary to reach correct answers, working within the model's existing 'grooves,' whereas SFT indiscriminately changes entire sequences regardless of which parts were already correct.

SFT causes catastrophic forgetting through scattered weight changes

Even with low learning rates, SFT scatters weight updates across the model and overrides pathways that may have been functioning correctly, while RL concentrates updates strictly on specific errors.

KL divergence penalties fail to solve SFT's inefficiency

While KL penalties prevent log probability drift from the base model, they cannot distinguish between tokens that needed correction versus those already acceptable, unlike RL's selective optimization.

🌍 Industry Context & Strategic Implications 4 insights

Chinese labs leverage LLM-as-judge for RL post-training

Chinese AI labs are using distillation strategies and LLM-as-judge evaluation within RL post-training loops—a development Corbitt considers more significant than supervised fine-tuning for fast-following frontier models.

Compute remains the primary moat for Western AI leadership

Despite algorithmic advances like DeepSeek's GRPO, Chinese companies remain constrained primarily by compute access, which is the key barrier preventing them from catching up to American frontier labs.

Recursive self-improvement is already occurring

Corbitt argues the industry is already operating within a recursive self-improvement loop, with models being used to generate training data and evaluations for subsequent iterations.

RL environments have become a cottage industry

A specialized market of companies building reinforcement learning environments has emerged to serve frontier labs, though Corbitt views this as a limited long-term investment opportunity despite current demand.

⚙️ Practical Implementation & Deployment 3 insights

Reward hacking is flagrant and manageable in narrow domains

While reward hacking is a valid concern, it manifests obviously in specific tasks—making it relatively easy to detect and mitigate compared to subtle alignment failures.

LoRA adapters drive efficiency for CoreWeave customers

CoreWeave uses Low-Rank Adaptation (LoRA) adapters to provide efficient, serverless RL fine-tuning that reduces latency and inference costs dramatically compared to using frontier models.

RL excels on narrow tasks but may trail frontier models on creativity

For creative writing and open-ended tasks, frontier models likely still outperform RL-tuned open-source models, but RL delivers superior cost and latency profiles for well-defined, narrow applications.

Bottom Line

For narrow, well-defined tasks where latency and cost matter, investing in RL fine-tuning of open-source models likely yields better performance and efficiency than SFT or frontier APIs, provided you build robust evaluation rubrics to monitor for reward hacking.

More from Cognitive Revolution

View all
Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve
2:12:51
Cognitive Revolution Cognitive Revolution

Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve

Steve Newman, creator of Google Docs and founder of the Golden Gate Institute for AI, shares his suite of 15+ bespoke AI tools designed to filter overwhelming information flows and reclaim deep focus time, demonstrating an iterative 'vibe coding' approach that prioritizes personal utility over agent optimization.

13 days ago · 7 points