Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
TL;DR
Yi Tay returns to Google DeepMind Singapore to lead the Reasoning and AGI team, explaining the shift toward on-policy reinforcement learning as the dominant paradigm for model reasoning and sharing the technical story behind Gemini's IMO Gold achievement using an end-to-end text approach.
🏢 Return to Google DeepMind & The Singapore Team 3 insights
Rejoining GDM feels like a saved game
Tay describes returning after 1.5 years as seamless—LDAP and infrastructure unchanged—like resuming a Pokemon save file, though Brain is now part of GDM and many things have evolved.
Leading Reasoning and AGI team
He leads the new Singapore team explicitly named with 'AGI' to signal the north star of developing toward artificial general intelligence, focusing on frontier research close to the model.
Transitioning to RL research
Having spent his career on architectures and pre-training, Tay has shifted focus to reinforcement learning as the primary modeling toolset for modern language models.
🎯 On-Policy RL vs. Imitation Learning 3 insights
On-policy training generates its own path
Modern LM RL uses on-policy learning where models generate their own outputs, receive reward signals, and train on their own trajectories—unlike SFT which mimics other models' outputs.
Imitation has limits for true capability
While imitation learning (watching tutorials) helps initially, both humans and models must eventually transition to on-policy learning through direct environmental feedback to achieve mastery.
Montessori approach to model training
On-policy RL resembles Montessori schooling: providing a safe environment for the model to discover its own path rather than copying predetermined trajectories.
🥇 IMO Gold: The Live Competition 3 insights
End-to-end text model approach
Unlike previous AlphaGeometry systems, the team pursued a pure text-in/text-out Gemini model, believing that if models cannot solve IMO problems, they cannot achieve AGI.
Real-time competition logistics
The IMO attempt happened live with team members in Australia running inference on fresh problems (P1-P6) released across different days, requiring prepared model checkpoints rather than iterative benchmarking.
One-week intensive training sprint
While the broader effort was long-term, Tay's specific contribution involved an intensive one-week period preparing the final model checkpoint used for the live competition.
⚡ Research Philosophy & Adaptation 2 insights
Maintain high learning rates
When prior assumptions are violated by new evidence, researchers should update 20-50% rather than 2%, completely invalidating worldviews when counter-examples emerge rather than being Bayesian prisoners.
AI crossing immersion thresholds
Current capabilities like coding and image generation have crossed into practical utility where models can parse spreadsheets from screenshots and generate matplotlib plots, moving beyond toy applications.
Bottom Line
The future of AI capability lies in on-policy reinforcement learning where models learn from their own generated trajectories rather than imitation, while researchers must maintain aggressively high 'learning rates' to abandon invalidated assumptions and adapt to paradigm shifts.
More from Latent Space
View all
The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin
Matei Zaharia and Reynold Xin detail Databricks' open-source 'Agent Cloud' platform (Omnigen), arguing that standardized protocols and persistent infrastructure—not just better models—will determine which enterprises successfully deploy collaborative, secure AI agents at scale.
AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan
Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.
⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai
Ronak Malde explains leaving DeepMind (and $2 billion in acquisition earnings) to found Trajectory.ai, arguing that AI products must evolve from static tools into "living systems" that continually learn from real-world user corrections across enterprise verticals like legal and finance.
The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.