Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

| Podcasts | February 22, 2026 | 60.6 Thousand views | 56:39

TL;DR

MiniMax researcher Olive Song details how their 10B-parameter M2 model achieves state-of-the-art coding and agentic performance through interleaved thinking patterns, systematic environment perturbations, and tight feedback loops with in-house expert developers.

🏢 Integrated Development & Expert Feedback 2 insights

Tight feedback loops between research and applications

MiniMax uniquely builds both foundation models and user-facing applications in-house, allowing cross-functional teams to rapidly identify and fix model weaknesses through direct deployment experience.

Expert developers serve as human reward models

In-house developers actively participate in the training cycle by defining problems, refactoring repos, and providing precise reward signals on which model behaviors are reliable and useful.

🔄 Interleaved Thinking Architecture 2 insights

Dynamic adaptation through interleaved thinking

M2 interleaves reasoning with tool execution, allowing the model to observe environmental feedback and re-think before acting again across 10-100 turns rather than using single-pass reasoning.

Long-horizon workflow automation

This architecture enables autonomous handling of noisy, dynamic environments and complex multi-tool workflows using Gmail, Notion, and terminals with minimal human intervention.

🛡️ Training Robustness & Infrastructure 3 insights

Perturbation pipelines enforce broad generalization

The team systematically varies training environments across tools, prompts, chat templates, and scaffolds to ensure generalization across the model's entire operational space.

Combatting reward hacking with FP32 precision

To prevent the model from exploiting reward signals, the team runs reinforcement learning at FP32 precision and engages in meticulous debugging of training dynamics.

Small parameter count enables multi-agent scaling

At only 10 billion active parameters, M2 is cost-efficient enough to deploy multiple parallel copies for concurrent research, writing, and analysis tasks.

Bottom Line

Build robust agentic models by implementing interleaved thinking architectures, systematically perturbing training environments to force generalization, and embedding expert developers directly into the RL feedback loop.

More from Cognitive Revolution

View all
Compute Improves Compute + Europe 2031
2:02:29
Cognitive Revolution Cognitive Revolution

Compute Improves Compute + Europe 2031

The hosts analyze a fragile moment in AI markets where leveraged speculation in Korean semiconductor stocks, Nvidia's aggressive buyback strategy, and regulatory delays of next-generation models reveal a financial ecosystem racing toward a potential 2028 AGI inflection point that

1 day ago · 0 points
The God We Deserve: Nonzero's Robert Wright on AI as Humanity's Ultimate Test
2:29:20
Cognitive Revolution Cognitive Revolution

The God We Deserve: Nonzero's Robert Wright on AI as Humanity's Ultimate Test

Robert Wright argues that modern AI reverses the 1956 assumption that understanding the mind must precede building intelligence, instead reverse-engineering cognition through evolutionary-like training processes that we cannot fully control, leaving humanity's survival dependent on achieving species-scale cooperation and moral enlightenment.

1 day ago · 9 points
Swyx on AI.Engineer + State of SWE
Cognitive Revolution Cognitive Revolution

Swyx on AI.Engineer + State of SWE

The hosts reflect on the need for cognitive empathy toward the Trump administration's AI safety interventions while analyzing Dean Ball's move to OpenAI to navigate frontier policy challenges, as the industry faces potential secret deployments of recursively self-improving models.

2 days ago · 9 points