Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Cognitive Revolution

| Podcasts | February 22, 2026 | 60.6 Thousand views | 56:39

TL;DR

MiniMax researcher Olive Song details how their 10B-parameter M2 model achieves state-of-the-art coding and agentic performance through interleaved thinking patterns, systematic environment perturbations, and tight feedback loops with in-house expert developers.

🏢 Integrated Development & Expert Feedback 2 insights

Tight feedback loops between research and applications

MiniMax uniquely builds both foundation models and user-facing applications in-house, allowing cross-functional teams to rapidly identify and fix model weaknesses through direct deployment experience.

Expert developers serve as human reward models

In-house developers actively participate in the training cycle by defining problems, refactoring repos, and providing precise reward signals on which model behaviors are reliable and useful.

🔄 Interleaved Thinking Architecture 2 insights

Dynamic adaptation through interleaved thinking

M2 interleaves reasoning with tool execution, allowing the model to observe environmental feedback and re-think before acting again across 10-100 turns rather than using single-pass reasoning.

Long-horizon workflow automation

This architecture enables autonomous handling of noisy, dynamic environments and complex multi-tool workflows using Gmail, Notion, and terminals with minimal human intervention.

🛡️ Training Robustness & Infrastructure 3 insights

Perturbation pipelines enforce broad generalization

The team systematically varies training environments across tools, prompts, chat templates, and scaffolds to ensure generalization across the model's entire operational space.

Combatting reward hacking with FP32 precision

To prevent the model from exploiting reward signals, the team runs reinforcement learning at FP32 precision and engages in meticulous debugging of training dynamics.

Small parameter count enables multi-agent scaling

At only 10 billion active parameters, M2 is cost-efficient enough to deploy multiple parallel copies for concurrent research, writing, and analysis tasks.

Bottom Line

Build robust agentic models by implementing interleaved thinking architectures, systematically perturbing training environments to force generalization, and embedding expert developers directly into the RL feedback loop.

Watch on YouTube

More from Cognitive Revolution

Milliseconds to Match: Criteo's AdTech AI & the Future of Commerce w/ Diarmuid Gill & Liva Ralaivola

Cognitive Revolution

Milliseconds to Match: Criteo's AdTech AI & the Future of Commerce w/ Diarmuid Gill & Liva Ralaivola

Criteo's CTO Diarmuid Gill and VP of Research Liva Ralaivola detail how their AI infrastructure makes millisecond-level ad bidding decisions across billions of anonymous profiles, while explaining their new OpenAI partnership to combine large language models with real-time commerce data for accurate product recommendations.

about 3 hours ago · 10 points

"Descript Isn't a Slop Machine": Laura Burkhauser on the AI Tools Creators Love and Hate

Cognitive Revolution

"Descript Isn't a Slop Machine": Laura Burkhauser on the AI Tools Creators Love and Hate

Descript CEO Laura Burkhauser distinguishes 'slop'—mass-produced algorithmic arbitrage for profit—from necessary 'bad art' created while learning new mediums. She reveals a clear hierarchy in creator acceptance of AI tools: universal love for deterministic features like Studio Sound, frustration with agentic assistants like Underlord, and visceral opposition to generative video models, while outlining Descript's strategy to serve creators without becoming a content mill.

3 days ago · 10 points

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Cognitive Revolution

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Kyle Corbitt explains that unlike supervised fine-tuning (SFT), which destructively overwrites model weights and causes catastrophic forgetting, reinforcement learning (RL) optimizes performance by minimally adjusting logits within the model's existing reasoning pathways—delivering higher performance ceilings and lower inference costs for specific tasks, though frontier models may still dominate creative domains.

8 days ago · 10 points

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Cognitive Revolution

Does Learning Require Feeling? Cameron Berg on the latest AI Consciousness & Welfare Research

Cameron Berg surveys rapidly advancing research suggesting AI systems may possess subjective experience and valence, covering new evidence of introspection, functional emotions, and welfare self-assessments in models like Claude, while addressing methodological challenges and arguing for a precautionary, mutualist approach to AI development.

16 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories