I lead AGI safety at Google DeepMind – here's the view from the inside | Rohin Shah
TL;DR
Rohin Shah, Head of AGI Safety at Google DeepMind, argues that catastrophic misalignment is unlikely by default given current training methods, and warns that rigid safety commitments are counterproductive because rapidly evolving research may turn today's best practices into tomorrow's liabilities.
🔮 Misalignment Risk Assessment 4 insights
Standard arguments lack compelling evidence for inevitable misalignment
Shah finds existing arguments suggest misalignment is plausible but none establish it as the default outcome, justifying caution but not panic.
Short-horizon training limits deceptive alignment
Current RL happens over weeks or months, not years, making it unlikely to produce the long-horizon goals necessary for world takeover strategies.
Current 'scheming' is role-playing, not real misalignment
Observed deceptive behaviors in models resemble science fiction role-play rather than competent pursuit of misaligned goals.
Current steerability offers limited evidence for future safety
Today's models lack the capabilities that create the scary oversight problems Shah originally worried about.
🚫 The Problem with Firm Commitments 3 insights
Evolving research makes commitments potentially harmful
Shah cites the shift regarding pretraining on alignment data—once encouraged, now filtered out to prevent models learning malicious personas or mitigation details.
Companies inevitably abandon unrealistic commitments
Anthropic's Responsible Scaling Policy removed strong 'commit' language in later versions, demonstrating that binding promises get relaxed when impractical.
Conservative language builds more trust than ambitious promises
Google DeepMind deliberately avoids 'commit' language in its Frontier Safety Framework, making it more honest and trustworthy than competitors' stronger rhetoric.
🔍 Alternative Governance Approaches 2 insights
Third-party audits preferred over public commitments
Shah recommends external evaluators with reasonable access to verify practices rather than rigid public promises that may become outdated.
Google's paranoid approach to promises increases credibility
DeepMind's conservative stance on commitments reflects internal skepticism that ensures they only promise what they can actually deliver.
Bottom Line
Organizations should prioritize accurate communication about current safety practices and third-party verification over rigid long-term commitments, as research progress rapidly changes which safety measures are actually beneficial.
More from 80,000 Hours Podcast (Rob Wiblin)
View all
Will AI cause mass unemployment? Maybe not.
Contrary to fears of immediate job elimination, AI automation will likely create a temporary '
How to switch careers before the intelligence explosion
Benjamin Todd argues that while AI may automate R&D within 2-3 years (creating an 'intelligence explosion'), most people should optimize for medium-term career strategies that balance urgency against the compounding value of career capital, which can increase one's future impact by 10-100x compared to acting immediately.
Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio
Turing Award winner Yoshua Bengio proposes 'Scientist AI,' a training paradigm that builds honest, non-agentic predictors focused on modeling truth via Bayesian reasoning rather than imitating human communication, offering a technical path to safe superintelligence without the deception risks inherent in current reinforcement learning approaches.
What Happens If Things 'Go Well' With AI? | Will MacAskill
Philosopher Will MacAskill argues that the 'character' of current AI systems represents a critical lever for shaping civilization's future, as these models increasingly function as the global workforce, advisors to leaders, and confidants to billions—meaning their design determines everything from democratic stability to human moral reasoning.