Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
TL;DR
Yi Tay returns to Google DeepMind Singapore to lead the Reasoning and AGI team, explaining the shift toward on-policy reinforcement learning as the dominant paradigm for model reasoning and sharing the technical story behind Gemini's IMO Gold achievement using an end-to-end text approach.
🏢 Return to Google DeepMind & The Singapore Team 3 insights
Rejoining GDM feels like a saved game
Tay describes returning after 1.5 years as seamless—LDAP and infrastructure unchanged—like resuming a Pokemon save file, though Brain is now part of GDM and many things have evolved.
Leading Reasoning and AGI team
He leads the new Singapore team explicitly named with 'AGI' to signal the north star of developing toward artificial general intelligence, focusing on frontier research close to the model.
Transitioning to RL research
Having spent his career on architectures and pre-training, Tay has shifted focus to reinforcement learning as the primary modeling toolset for modern language models.
🎯 On-Policy RL vs. Imitation Learning 3 insights
On-policy training generates its own path
Modern LM RL uses on-policy learning where models generate their own outputs, receive reward signals, and train on their own trajectories—unlike SFT which mimics other models' outputs.
Imitation has limits for true capability
While imitation learning (watching tutorials) helps initially, both humans and models must eventually transition to on-policy learning through direct environmental feedback to achieve mastery.
Montessori approach to model training
On-policy RL resembles Montessori schooling: providing a safe environment for the model to discover its own path rather than copying predetermined trajectories.
🥇 IMO Gold: The Live Competition 3 insights
End-to-end text model approach
Unlike previous AlphaGeometry systems, the team pursued a pure text-in/text-out Gemini model, believing that if models cannot solve IMO problems, they cannot achieve AGI.
Real-time competition logistics
The IMO attempt happened live with team members in Australia running inference on fresh problems (P1-P6) released across different days, requiring prepared model checkpoints rather than iterative benchmarking.
One-week intensive training sprint
While the broader effort was long-term, Tay's specific contribution involved an intensive one-week period preparing the final model checkpoint used for the live competition.
⚡ Research Philosophy & Adaptation 2 insights
Maintain high learning rates
When prior assumptions are violated by new evidence, researchers should update 20-50% rather than 2%, completely invalidating worldviews when counter-examples emerge rather than being Bayesian prisoners.
AI crossing immersion thresholds
Current capabilities like coding and image generation have crossed into practical utility where models can parse spreadsheets from screenshots and generate matplotlib plots, moving beyond toy applications.
Bottom Line
The future of AI capability lies in on-policy reinforcement learning where models learn from their own generated trajectories rather than imitation, while researchers must maintain aggressively high 'learning rates' to abandon invalidated assumptions and adapt to paradigm shifts.
More from Latent Space
View all
🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found
Physicist Alex Lubyansky discusses how GPT-5 and reasoning models like o3 have achieved superhuman capabilities in theoretical physics, solving the year-long mystery of single minus gluon tree amplitudes and reproducing complex research in minutes rather than months.
The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition
Applied Intuition is building the unified 'Android for physical machines' to solve OS fragmentation across vehicles and industrial equipment, enabling modern AI deployment through simulation tools, proprietary operating systems, and end-to-end autonomy models with a 1,000-engineer team.
CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify
Shopify CTO Mikhail Parakhin reveals that AI agents have achieved nearly 100% daily adoption among developers, driving a 30% month-over-month surge in PR merges that is breaking traditional CI/CD pipelines, and argues that organizations must shift from parallel token-burning agents to high-latency, critique-loop architectures using expensive pro-level models for code review.
🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Noetik is tackling the 95% failure rate of cancer clinical trials by training transformers on proprietary multimodal patient tumor data to identify hidden biological subtypes and match therapies to responsive populations, moving beyond simplistic biomarkers and outdated cell lines.