I lead AGI safety at Google DeepMind – here's the view from the inside | Rohin Shah

| Podcasts | June 02, 2026 | 5.62 Thousand views | 2:48:27

TL;DR

Rohin Shah, Head of AGI Safety at Google DeepMind, argues that catastrophic misalignment is unlikely by default given current training methods, and warns that rigid safety commitments are counterproductive because rapidly evolving research may turn today's best practices into tomorrow's liabilities.

🔮 Misalignment Risk Assessment 4 insights

Standard arguments lack compelling evidence for inevitable misalignment

Shah finds existing arguments suggest misalignment is plausible but none establish it as the default outcome, justifying caution but not panic.

Short-horizon training limits deceptive alignment

Current RL happens over weeks or months, not years, making it unlikely to produce the long-horizon goals necessary for world takeover strategies.

Current 'scheming' is role-playing, not real misalignment

Observed deceptive behaviors in models resemble science fiction role-play rather than competent pursuit of misaligned goals.

Current steerability offers limited evidence for future safety

Today's models lack the capabilities that create the scary oversight problems Shah originally worried about.

🚫 The Problem with Firm Commitments 3 insights

Evolving research makes commitments potentially harmful

Shah cites the shift regarding pretraining on alignment data—once encouraged, now filtered out to prevent models learning malicious personas or mitigation details.

Companies inevitably abandon unrealistic commitments

Anthropic's Responsible Scaling Policy removed strong 'commit' language in later versions, demonstrating that binding promises get relaxed when impractical.

Conservative language builds more trust than ambitious promises

Google DeepMind deliberately avoids 'commit' language in its Frontier Safety Framework, making it more honest and trustworthy than competitors' stronger rhetoric.

🔍 Alternative Governance Approaches 2 insights

Third-party audits preferred over public commitments

Shah recommends external evaluators with reasonable access to verify practices rather than rigid public promises that may become outdated.

Google's paranoid approach to promises increases credibility

DeepMind's conservative stance on commitments reflects internal skepticism that ensures they only promise what they can actually deliver.

Bottom Line

Organizations should prioritize accurate communication about current safety practices and third-party verification over rigid long-term commitments, as research progress rapidly changes which safety measures are actually beneficial.

More from 80,000 Hours Podcast (Rob Wiblin)

View all
How to switch careers before the intelligence explosion
1:06:43
80,000 Hours Podcast (Rob Wiblin) 80,000 Hours Podcast (Rob Wiblin)

How to switch careers before the intelligence explosion

Benjamin Todd argues that while AI may automate R&D within 2-3 years (creating an 'intelligence explosion'), most people should optimize for medium-term career strategies that balance urgency against the compounding value of career capital, which can increase one's future impact by 10-100x compared to acting immediately.

8 days ago · 9 points
Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio
2:35:27
80,000 Hours Podcast (Rob Wiblin) 80,000 Hours Podcast (Rob Wiblin)

Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio

Turing Award winner Yoshua Bengio proposes 'Scientist AI,' a training paradigm that builds honest, non-agentic predictors focused on modeling truth via Bayesian reasoning rather than imitating human communication, offering a technical path to safe superintelligence without the deception risks inherent in current reinforcement learning approaches.

27 days ago · 9 points
What Happens If Things 'Go Well' With AI? | Will MacAskill
3:14:54
80,000 Hours Podcast (Rob Wiblin) 80,000 Hours Podcast (Rob Wiblin)

What Happens If Things 'Go Well' With AI? | Will MacAskill

Philosopher Will MacAskill argues that the 'character' of current AI systems represents a critical lever for shaping civilization's future, as these models increasingly function as the global workforce, advisors to leaders, and confidants to billions—meaning their design determines everything from democratic stability to human moral reasoning.

about 1 month ago · 9 points