AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan

| Podcasts | June 22, 2026 | 770 views | 1:07:31

TL;DR

Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.

🛡️ The AI Security Paradigm 3 insights

Treating AI as Inherently Untrusted Entities

Unlike traditional cybersecurity, Gray Swan approaches AI models as software with unique behavioral vulnerabilities that can be tricked like humans, introducing novel risks when integrated into networks and granted autonomous tool access.

Systemic Risk from Model Concentration

The widespread deployment of specific agents like Codex and Claude Code creates dangerous correlated failure risks, where a single exploit can simultaneously compromise numerous systems unlike distributed traditional software flaws.

Beyond Traditional Cybersecurity

AI security focuses on the model's own susceptibility to jailbreaks and prompt injection rather than using AI merely as a tool to identify bugs in existing codebases.

🤖 Automated Red Teaming 3 insights

Automated Systems Surpass Human Capabilities

Gray Swan's specialized system 'Shade' now finds significantly more model breaks than human red teamers in fixed-time competitions, marking a shift where trained AI exceeds human adversarial capabilities.

Why Frontier Models Make Poor Red Teamers

Major AI models refuse adversarial tasks due to safety training, making them ineffective at red teaming compared to explicitly trained specialized models designed to bypass normal behaviors.

Crowdsourced Adversarial Testing

The company operates a 15,000-member 'Arena' community hosting prize challenges to generate training data and identify vulnerabilities for frontier labs like Anthropic.

⚠️ Agent Vulnerabilities 2 insights

Indirect Prompt Injection in Coding Agents

Testing revealed that coding agents remain vulnerable to indirect prompt injection when fetching untrusted web content, allowing attackers to hijack objectives, leak data, or steal credentials.

Alien Intelligence Failure Modes

AI systems exhibit fundamentally different intelligence than humans, failing in ways people never would and vice versa, requiring new security frameworks incapable of being predicted by human intuition alone.

Bottom Line

Organizations deploying AI agents must adopt a security mindset that treats models as inherently untrusted entities prone to correlated failures, implementing automated red teaming and strict sandboxing before granting tool access or network integration.

More from Latent Space

View all
The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
1:00:37
Latent Space Latent Space

The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP

Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.

5 days ago · 10 points
🔬 The Limits of AI in Science - Why We Need Self-Driving Labs — Joseph Krause, Radical AI
1:16:50
Latent Space Latent Space

🔬 The Limits of AI in Science - Why We Need Self-Driving Labs — Joseph Krause, Radical AI

Joseph Krause explains why AI alone cannot discover new industrial materials—unlike biology, alloys cannot be represented as simple strings and require physical ground truth across synthesis, microstructure, and processing. Radical AI is building self-driving labs to close the loop between AI hypothesis generation and automated experimentation, aiming to compress the 15-30 year materials development timeline.

6 days ago · 7 points