The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking
TL;DR
Kyle Corbitt explains that unlike supervised fine-tuning (SFT), which destructively overwrites model weights and causes catastrophic forgetting, reinforcement learning (RL) optimizes performance by minimally adjusting logits within the model's existing reasoning pathways—delivering higher performance ceilings and lower inference costs for specific tasks, though frontier models may still dominate creative domains.
🧠 RL vs SFT: Mechanistic Differences 3 insights
RL preserves pre-trained pathways while SFT overwrites them
RL updates only the specific tokens and log probabilities necessary to reach correct answers, working within the model's existing 'grooves,' whereas SFT indiscriminately changes entire sequences regardless of which parts were already correct.
SFT causes catastrophic forgetting through scattered weight changes
Even with low learning rates, SFT scatters weight updates across the model and overrides pathways that may have been functioning correctly, while RL concentrates updates strictly on specific errors.
KL divergence penalties fail to solve SFT's inefficiency
While KL penalties prevent log probability drift from the base model, they cannot distinguish between tokens that needed correction versus those already acceptable, unlike RL's selective optimization.
🌍 Industry Context & Strategic Implications 4 insights
Chinese labs leverage LLM-as-judge for RL post-training
Chinese AI labs are using distillation strategies and LLM-as-judge evaluation within RL post-training loops—a development Corbitt considers more significant than supervised fine-tuning for fast-following frontier models.
Compute remains the primary moat for Western AI leadership
Despite algorithmic advances like DeepSeek's GRPO, Chinese companies remain constrained primarily by compute access, which is the key barrier preventing them from catching up to American frontier labs.
Recursive self-improvement is already occurring
Corbitt argues the industry is already operating within a recursive self-improvement loop, with models being used to generate training data and evaluations for subsequent iterations.
RL environments have become a cottage industry
A specialized market of companies building reinforcement learning environments has emerged to serve frontier labs, though Corbitt views this as a limited long-term investment opportunity despite current demand.
⚙️ Practical Implementation & Deployment 3 insights
Reward hacking is flagrant and manageable in narrow domains
While reward hacking is a valid concern, it manifests obviously in specific tasks—making it relatively easy to detect and mitigate compared to subtle alignment failures.
LoRA adapters drive efficiency for CoreWeave customers
CoreWeave uses Low-Rank Adaptation (LoRA) adapters to provide efficient, serverless RL fine-tuning that reduces latency and inference costs dramatically compared to using frontier models.
RL excels on narrow tasks but may trail frontier models on creativity
For creative writing and open-ended tasks, frontier models likely still outperform RL-tuned open-source models, but RL delivers superior cost and latency profiles for well-defined, narrow applications.
Bottom Line
For narrow, well-defined tasks where latency and cost matter, investing in RL fine-tuning of open-source models likely yields better performance and efficiency than SFT or frontier APIs, provided you build robust evaluation rubrics to monitor for reward hacking.
More from Cognitive Revolution
View all
Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research
Elicit co-founders Andreas Stuhlmüller and Jungwon Byun explain how their platform ensures reliable AI reasoning for high-stakes decisions through a domain-specific language that guarantees execution of structured workflows, serving top life sciences companies while betting that legible, process-supervised reasoning will outperform black-box neural approaches.
AI in the AM — Week 2 Highlights (June 2026)
Anthropic's Fable launch revealed a model with aggressive safety guardrails that falls back to weaker models when facing production systems or ML research, yet demonstrates unprecedented autonomous agency in building complex 3D worlds and recursively training specialist models, while explicitly lacking novel research capabilities.
RSI for Me but not for Thee?
The hosts analyze how Fable represents a qualitative shift in AI collaboration, requiring users to expand their "task imagination" for multi-day projects while organizations must eliminate "token anxiety" to fully map AI capabilities through aggressive internal experimentation.
Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work
Glean's Work AI Index 2026 survey of 6,000 workers reveals a stark disconnect: while 87% use AI and report saving 13 hours weekly, only 13% see their organization performing significantly better. The gap stems from "bot sitting" (6.4 hours of weekly hidden labor to manage AI) and "bot shit" (69% admit shipping unvetted AI outputs they cannot defend), which erode productivity gains and work quality.