Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

Latent Space

| Podcasts | January 23, 2026 | 5.55 Thousand views | 1:32:05

TL;DR

Yi Tay returns to Google DeepMind Singapore to lead the Reasoning and AGI team, explaining the shift toward on-policy reinforcement learning as the dominant paradigm for model reasoning and sharing the technical story behind Gemini's IMO Gold achievement using an end-to-end text approach.

🏢 Return to Google DeepMind & The Singapore Team 3 insights

Rejoining GDM feels like a saved game

Tay describes returning after 1.5 years as seamless—LDAP and infrastructure unchanged—like resuming a Pokemon save file, though Brain is now part of GDM and many things have evolved.

Leading Reasoning and AGI team

He leads the new Singapore team explicitly named with 'AGI' to signal the north star of developing toward artificial general intelligence, focusing on frontier research close to the model.

Transitioning to RL research

Having spent his career on architectures and pre-training, Tay has shifted focus to reinforcement learning as the primary modeling toolset for modern language models.

🎯 On-Policy RL vs. Imitation Learning 3 insights

On-policy training generates its own path

Modern LM RL uses on-policy learning where models generate their own outputs, receive reward signals, and train on their own trajectories—unlike SFT which mimics other models' outputs.

Imitation has limits for true capability

While imitation learning (watching tutorials) helps initially, both humans and models must eventually transition to on-policy learning through direct environmental feedback to achieve mastery.

Montessori approach to model training

On-policy RL resembles Montessori schooling: providing a safe environment for the model to discover its own path rather than copying predetermined trajectories.

🥇 IMO Gold: The Live Competition 3 insights

End-to-end text model approach

Unlike previous AlphaGeometry systems, the team pursued a pure text-in/text-out Gemini model, believing that if models cannot solve IMO problems, they cannot achieve AGI.

Real-time competition logistics

The IMO attempt happened live with team members in Australia running inference on fresh problems (P1-P6) released across different days, requiring prepared model checkpoints rather than iterative benchmarking.

One-week intensive training sprint

While the broader effort was long-term, Tay's specific contribution involved an intensive one-week period preparing the final model checkpoint used for the live competition.

⚡ Research Philosophy & Adaptation 2 insights

Maintain high learning rates

When prior assumptions are violated by new evidence, researchers should update 20-50% rather than 2%, completely invalidating worldviews when counter-examples emerge rather than being Bayesian prisoners.

AI crossing immersion thresholds

Current capabilities like coding and image generation have crossed into practical utility where models can parse spreadsheets from screenshots and generate matplotlib plots, moving beyond toy applications.

Bottom Line

The future of AI capability lies in on-policy reinforcement learning where models learn from their own generated trajectories rather than imitation, while researchers must maintain aggressively high 'learning rates' to abandon invalidated assumptions and adapt to paradigm shifts.

Watch on YouTube

More from Latent Space

🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found

Latent Space

🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found

Physicist Alex Lubyansky discusses how GPT-5 and reasoning models like o3 have achieved superhuman capabilities in theoretical physics, solving the year-long mystery of single minus gluon tree amplitudes and reproducing complex research in minutes rather than months.

4 days ago · 9 points

The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition

Latent Space

The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition

Applied Intuition is building the unified 'Android for physical machines' to solve OS fragmentation across vehicles and industrial equipment, enabling modern AI deployment through simulation tools, proprietary operating systems, and end-to-end autonomy models with a 1,000-engineer team.

12 days ago · 9 points

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

Latent Space

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

Shopify CTO Mikhail Parakhin reveals that AI agents have achieved nearly 100% daily adoption among developers, driving a 30% month-over-month surge in PR merges that is breaking traditional CI/CD pipelines, and argues that organizations must shift from parallel token-burning agents to high-latency, critique-loop architectures using expensive pro-level models for code review.

17 days ago · 10 points

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Latent Space

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Noetik is tackling the 95% failure rate of cancer clinical trials by training transformers on proprietary multimodal patient tumor data to identify hidden biological subtypes and match therapies to responsive populations, moving beyond simplistic biomarkers and outdated cell lines.

19 days ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories