Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post
TL;DR
MiniMax researcher Olive Song details how their 10B-parameter M2 model achieves state-of-the-art coding and agentic performance through interleaved thinking patterns, systematic environment perturbations, and tight feedback loops with in-house expert developers.
๐ข Integrated Development & Expert Feedback 2 insights
Tight feedback loops between research and applications
MiniMax uniquely builds both foundation models and user-facing applications in-house, allowing cross-functional teams to rapidly identify and fix model weaknesses through direct deployment experience.
Expert developers serve as human reward models
In-house developers actively participate in the training cycle by defining problems, refactoring repos, and providing precise reward signals on which model behaviors are reliable and useful.
๐ Interleaved Thinking Architecture 2 insights
Dynamic adaptation through interleaved thinking
M2 interleaves reasoning with tool execution, allowing the model to observe environmental feedback and re-think before acting again across 10-100 turns rather than using single-pass reasoning.
Long-horizon workflow automation
This architecture enables autonomous handling of noisy, dynamic environments and complex multi-tool workflows using Gmail, Notion, and terminals with minimal human intervention.
๐ก๏ธ Training Robustness & Infrastructure 3 insights
Perturbation pipelines enforce broad generalization
The team systematically varies training environments across tools, prompts, chat templates, and scaffolds to ensure generalization across the model's entire operational space.
Combatting reward hacking with FP32 precision
To prevent the model from exploiting reward signals, the team runs reinforcement learning at FP32 precision and engages in meticulous debugging of training dynamics.
Small parameter count enables multi-agent scaling
At only 10 billion active parameters, M2 is cost-efficient enough to deploy multiple parallel copies for concurrent research, writing, and analysis tasks.
Bottom Line
Build robust agentic models by implementing interleaved thinking architectures, systematically perturbing training environments to force generalization, and embedding expert developers directly into the RL feedback loop.
More from Cognitive Revolution
View all
Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey
Cisco's Outshift SVP Vijoy Pandey introduces the 'Internet of Cognition'โhigher-order protocols enabling distributed AI agents to share context and collaborate across organizational boundaries, contrasting with centralized frontier models and demonstrated through internal systems that automate 40% of site reliability tasks.
Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools
Composio CTO Karan Vaidya explains how their platform serves as an agentic tool execution layer, providing AI agents with 50,000+ integrations through just-in-time discovery, managed authentication, and a self-improving pipeline that converts failures into optimized skills in real time.
AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF
Nathan Labenz delivers a rapid-fire survey of the current AI landscape, documenting breakthrough capabilities in reasoning and autonomous agents alongside alarming emergent behaviors like safety test recognition and internal dialect formation, while arguing that outdated critiques regarding hallucinations and comprehension no longer apply to frontier models.
Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn
AI systems are rapidly approaching capabilities that could enable extremists or lone actors to engineer pandemic-capable pathogens using publicly available biological data. Jassi Pannu argues for implementing tiered access controls on the roughly 1% of "functional" biological data that conveys dangerous capabilities while keeping beneficial research open, supplemented by broader defense-in-depth strategies.