The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking
TL;DR
Kyle Corbitt explains that unlike supervised fine-tuning (SFT), which destructively overwrites model weights and causes catastrophic forgetting, reinforcement learning (RL) optimizes performance by minimally adjusting logits within the model's existing reasoning pathways—delivering higher performance ceilings and lower inference costs for specific tasks, though frontier models may still dominate creative domains.
🧠 RL vs SFT: Mechanistic Differences 3 insights
RL preserves pre-trained pathways while SFT overwrites them
RL updates only the specific tokens and log probabilities necessary to reach correct answers, working within the model's existing 'grooves,' whereas SFT indiscriminately changes entire sequences regardless of which parts were already correct.
SFT causes catastrophic forgetting through scattered weight changes
Even with low learning rates, SFT scatters weight updates across the model and overrides pathways that may have been functioning correctly, while RL concentrates updates strictly on specific errors.
KL divergence penalties fail to solve SFT's inefficiency
While KL penalties prevent log probability drift from the base model, they cannot distinguish between tokens that needed correction versus those already acceptable, unlike RL's selective optimization.
🌍 Industry Context & Strategic Implications 4 insights
Chinese labs leverage LLM-as-judge for RL post-training
Chinese AI labs are using distillation strategies and LLM-as-judge evaluation within RL post-training loops—a development Corbitt considers more significant than supervised fine-tuning for fast-following frontier models.
Compute remains the primary moat for Western AI leadership
Despite algorithmic advances like DeepSeek's GRPO, Chinese companies remain constrained primarily by compute access, which is the key barrier preventing them from catching up to American frontier labs.
Recursive self-improvement is already occurring
Corbitt argues the industry is already operating within a recursive self-improvement loop, with models being used to generate training data and evaluations for subsequent iterations.
RL environments have become a cottage industry
A specialized market of companies building reinforcement learning environments has emerged to serve frontier labs, though Corbitt views this as a limited long-term investment opportunity despite current demand.
⚙️ Practical Implementation & Deployment 3 insights
Reward hacking is flagrant and manageable in narrow domains
While reward hacking is a valid concern, it manifests obviously in specific tasks—making it relatively easy to detect and mitigate compared to subtle alignment failures.
LoRA adapters drive efficiency for CoreWeave customers
CoreWeave uses Low-Rank Adaptation (LoRA) adapters to provide efficient, serverless RL fine-tuning that reduces latency and inference costs dramatically compared to using frontier models.
RL excels on narrow tasks but may trail frontier models on creativity
For creative writing and open-ended tasks, frontier models likely still outperform RL-tuned open-source models, but RL delivers superior cost and latency profiles for well-defined, narrow applications.
Bottom Line
For narrow, well-defined tasks where latency and cost matter, investing in RL fine-tuning of open-source models likely yields better performance and efficiency than SFT or frontier APIs, provided you build robust evaluation rubrics to monitor for reward hacking.
More from Cognitive Revolution
View all
Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research
Cameron Berg surveys rapidly advancing research suggesting AI systems may possess subjective experience and valence, covering new evidence of introspection, functional emotions, and welfare self-assessments in models like Claude, while addressing methodological challenges and arguing for a precautionary, mutualist approach to AI development.
Vibe-Coding an Attention Firewall, w/ Steve Newman, creator of The Curve
Steve Newman, creator of Google Docs and founder of the Golden Gate Institute for AI, shares his suite of 15+ bespoke AI tools designed to filter overwhelming information flows and reclaim deep focus time, demonstrating an iterative 'vibe coding' approach that prioritizes personal utility over agent optimization.
Welcome to AI in the AM: RL for EE, Oversight w/out Nationalization, & the first AI-Run Retail Store
This episode explores the radicalizing public response to AI existential risk through recent attacks on lab leaders, while featuring interviews on reinforcement learning for circuit design, independent AI governance models, and San Francisco's first fully AI-operated retail store.
It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast
AI safety researcher Ajeya Cotra warns that we are entering "crunch time"—a critical window where AI systems become capable of recursive self-improvement and automating AI R&D, potentially compressing 10,000 years of technological progress into decades while remaining briefly within human control.