Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

| Podcasts | January 23, 2026 | 5.55 Thousand views | 1:32:05

TL;DR

Yi Tay returns to Google DeepMind Singapore to lead the Reasoning and AGI team, explaining the shift toward on-policy reinforcement learning as the dominant paradigm for model reasoning and sharing the technical story behind Gemini's IMO Gold achievement using an end-to-end text approach.

🏢 Return to Google DeepMind & The Singapore Team 3 insights

Rejoining GDM feels like a saved game

Tay describes returning after 1.5 years as seamless—LDAP and infrastructure unchanged—like resuming a Pokemon save file, though Brain is now part of GDM and many things have evolved.

Leading Reasoning and AGI team

He leads the new Singapore team explicitly named with 'AGI' to signal the north star of developing toward artificial general intelligence, focusing on frontier research close to the model.

Transitioning to RL research

Having spent his career on architectures and pre-training, Tay has shifted focus to reinforcement learning as the primary modeling toolset for modern language models.

🎯 On-Policy RL vs. Imitation Learning 3 insights

On-policy training generates its own path

Modern LM RL uses on-policy learning where models generate their own outputs, receive reward signals, and train on their own trajectories—unlike SFT which mimics other models' outputs.

Imitation has limits for true capability

While imitation learning (watching tutorials) helps initially, both humans and models must eventually transition to on-policy learning through direct environmental feedback to achieve mastery.

Montessori approach to model training

On-policy RL resembles Montessori schooling: providing a safe environment for the model to discover its own path rather than copying predetermined trajectories.

🥇 IMO Gold: The Live Competition 3 insights

End-to-end text model approach

Unlike previous AlphaGeometry systems, the team pursued a pure text-in/text-out Gemini model, believing that if models cannot solve IMO problems, they cannot achieve AGI.

Real-time competition logistics

The IMO attempt happened live with team members in Australia running inference on fresh problems (P1-P6) released across different days, requiring prepared model checkpoints rather than iterative benchmarking.

One-week intensive training sprint

While the broader effort was long-term, Tay's specific contribution involved an intensive one-week period preparing the final model checkpoint used for the live competition.

Research Philosophy & Adaptation 2 insights

Maintain high learning rates

When prior assumptions are violated by new evidence, researchers should update 20-50% rather than 2%, completely invalidating worldviews when counter-examples emerge rather than being Bayesian prisoners.

AI crossing immersion thresholds

Current capabilities like coding and image generation have crossed into practical utility where models can parse spreadsheets from screenshots and generate matplotlib plots, moving beyond toy applications.

Bottom Line

The future of AI capability lies in on-policy reinforcement learning where models learn from their own generated trajectories rather than imitation, while researchers must maintain aggressively high 'learning rates' to abandon invalidated assumptions and adapt to paradigm shifts.

More from Latent Space

View all
Dreamer: the Agent OS for Everyone — David Singleton
1:04:23
Latent Space Latent Space

Dreamer: the Agent OS for Everyone — David Singleton

David Singleton introduces Dreamer as an 'Agent OS' that combines a personal AI Sidekick with a marketplace of tools and agents, enabling both non-technical users and engineers to build, customize, and deploy AI applications through natural language while maintaining privacy through centralized, OS-level architecture.

5 days ago · 9 points