I lead AGI safety at Google DeepMind – here's the view from the inside | Rohin Shah

80,000 Hours Podcast (Rob Wiblin)

| Podcasts | June 02, 2026 | 14.8 Thousand views | 2:48:27

TL;DR

Rohin Shah, Head of AGI Safety at Google DeepMind, argues that catastrophic misalignment is unlikely by default given current training methods, and warns that rigid safety commitments are counterproductive because rapidly evolving research may turn today's best practices into tomorrow's liabilities.

🔮 Misalignment Risk Assessment 4 insights

Standard arguments lack compelling evidence for inevitable misalignment

Shah finds existing arguments suggest misalignment is plausible but none establish it as the default outcome, justifying caution but not panic.

Short-horizon training limits deceptive alignment

Current RL happens over weeks or months, not years, making it unlikely to produce the long-horizon goals necessary for world takeover strategies.

Current 'scheming' is role-playing, not real misalignment

Observed deceptive behaviors in models resemble science fiction role-play rather than competent pursuit of misaligned goals.

Current steerability offers limited evidence for future safety

Today's models lack the capabilities that create the scary oversight problems Shah originally worried about.

🚫 The Problem with Firm Commitments 3 insights

Evolving research makes commitments potentially harmful

Shah cites the shift regarding pretraining on alignment data—once encouraged, now filtered out to prevent models learning malicious personas or mitigation details.

Companies inevitably abandon unrealistic commitments

Anthropic's Responsible Scaling Policy removed strong 'commit' language in later versions, demonstrating that binding promises get relaxed when impractical.

Conservative language builds more trust than ambitious promises

Google DeepMind deliberately avoids 'commit' language in its Frontier Safety Framework, making it more honest and trustworthy than competitors' stronger rhetoric.

🔍 Alternative Governance Approaches 2 insights

Third-party audits preferred over public commitments

Shah recommends external evaluators with reasonable access to verify practices rather than rigid public promises that may become outdated.

Google's paranoid approach to promises increases credibility

DeepMind's conservative stance on commitments reflects internal skepticism that ensures they only promise what they can actually deliver.

Bottom Line

Organizations should prioritize accurate communication about current safety practices and third-party verification over rigid long-term commitments, as research progress rapidly changes which safety measures are actually beneficial.

Watch on YouTube

More from 80,000 Hours Podcast (Rob Wiblin)

The pattern that says we're due for another transformation

80,000 Hours Podcast (Rob Wiblin)

The pattern that says we're due for another transformation

Advanced AI could trigger a societal transformation as profound as the Agricultural or Industrial Revolutions within decades rather than centuries by automating economically valuable human labor, creating both unprecedented prosperity and existential risks that make AI safety work a critical priority.

about 1 month ago · 8 points

Will AI cause mass unemployment? Maybe not.

80,000 Hours Podcast (Rob Wiblin)

Will AI cause mass unemployment? Maybe not.

Contrary to fears of immediate job elimination, AI automation will likely create a temporary '

about 2 months ago · 0 points

How to switch careers before the intelligence explosion

80,000 Hours Podcast (Rob Wiblin)

How to switch careers before the intelligence explosion

Benjamin Todd argues that while AI may automate R&D within 2-3 years (creating an 'intelligence explosion'), most people should optimize for medium-term career strategies that balance urgency against the compounding value of career capital, which can increase one's future impact by 10-100x compared to acting immediately.

about 2 months ago · 9 points

Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio

80,000 Hours Podcast (Rob Wiblin)

Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio

Turing Award winner Yoshua Bengio proposes 'Scientist AI,' a training paradigm that builds honest, non-agentic predictors focused on modeling truth via Bayesian reasoning rather than imitating human communication, offering a technical path to safe superintelligence without the deception risks inherent in current reinforcement learning approaches.

2 months ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories