Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
TL;DR
Yi Tay returns to Google DeepMind Singapore to lead the Reasoning and AGI team, explaining the shift toward on-policy reinforcement learning as the dominant paradigm for model reasoning and sharing the technical story behind Gemini's IMO Gold achievement using an end-to-end text approach.
🏢 Return to Google DeepMind & The Singapore Team 3 insights
Rejoining GDM feels like a saved game
Tay describes returning after 1.5 years as seamless—LDAP and infrastructure unchanged—like resuming a Pokemon save file, though Brain is now part of GDM and many things have evolved.
Leading Reasoning and AGI team
He leads the new Singapore team explicitly named with 'AGI' to signal the north star of developing toward artificial general intelligence, focusing on frontier research close to the model.
Transitioning to RL research
Having spent his career on architectures and pre-training, Tay has shifted focus to reinforcement learning as the primary modeling toolset for modern language models.
🎯 On-Policy RL vs. Imitation Learning 3 insights
On-policy training generates its own path
Modern LM RL uses on-policy learning where models generate their own outputs, receive reward signals, and train on their own trajectories—unlike SFT which mimics other models' outputs.
Imitation has limits for true capability
While imitation learning (watching tutorials) helps initially, both humans and models must eventually transition to on-policy learning through direct environmental feedback to achieve mastery.
Montessori approach to model training
On-policy RL resembles Montessori schooling: providing a safe environment for the model to discover its own path rather than copying predetermined trajectories.
🥇 IMO Gold: The Live Competition 3 insights
End-to-end text model approach
Unlike previous AlphaGeometry systems, the team pursued a pure text-in/text-out Gemini model, believing that if models cannot solve IMO problems, they cannot achieve AGI.
Real-time competition logistics
The IMO attempt happened live with team members in Australia running inference on fresh problems (P1-P6) released across different days, requiring prepared model checkpoints rather than iterative benchmarking.
One-week intensive training sprint
While the broader effort was long-term, Tay's specific contribution involved an intensive one-week period preparing the final model checkpoint used for the live competition.
⚡ Research Philosophy & Adaptation 2 insights
Maintain high learning rates
When prior assumptions are violated by new evidence, researchers should update 20-50% rather than 2%, completely invalidating worldviews when counter-examples emerge rather than being Bayesian prisoners.
AI crossing immersion thresholds
Current capabilities like coding and image generation have crossed into practical utility where models can parse spreadsheets from screenshots and generate matplotlib plots, moving beyond toy applications.
Bottom Line
The future of AI capability lies in on-policy reinforcement learning where models learn from their own generated trajectories rather than imitation, while researchers must maintain aggressively high 'learning rates' to abandon invalidated assumptions and adapt to paradigm shifts.
More from Latent Space
View all
🔬There Is No AlphaFold for Materials — AI for Materials Discovery with Heather Kulik
MIT professor Heather Kulik explains how AI discovered quantum phenomena to create 4x tougher polymers and why materials science lacks an 'AlphaFold' equivalent due to missing experimental datasets, emphasizing that domain expertise remains essential to validate AI predictions in chemistry.
Dreamer: the Agent OS for Everyone — David Singleton
David Singleton introduces Dreamer as an 'Agent OS' that combines a personal AI Sidekick with a marketplace of tools and agents, enabling both non-technical users and engineers to build, customize, and deploy AI applications through natural language while maintaining privacy through centralized, OS-level architecture.
Why Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork/Code
Anthropic's Felix Rieseberg explains why AI agents need their own virtual computers to be effective, arguing that confining Claude to chat interfaces severely limits capability. He details how this philosophy shaped Claude Cowork and why product development is shifting from lengthy planning to rapidly building multiple prototypes simultaneously.
⚡️Monty: the ultrafast Python interpreter by Agents for Agents — Samuel Colvin, Pydantic
Samuel Colvin from Pydantic introduces Monty, a Rust-based Python interpreter designed specifically for AI agents that achieves sub-microsecond execution latency by running in-process, bridging the gap between rigid tool calling and heavy containerized sandboxes.