AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan
TL;DR
Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.
🛡️ The AI Security Paradigm 3 insights
Treating AI as Inherently Untrusted Entities
Unlike traditional cybersecurity, Gray Swan approaches AI models as software with unique behavioral vulnerabilities that can be tricked like humans, introducing novel risks when integrated into networks and granted autonomous tool access.
Systemic Risk from Model Concentration
The widespread deployment of specific agents like Codex and Claude Code creates dangerous correlated failure risks, where a single exploit can simultaneously compromise numerous systems unlike distributed traditional software flaws.
Beyond Traditional Cybersecurity
AI security focuses on the model's own susceptibility to jailbreaks and prompt injection rather than using AI merely as a tool to identify bugs in existing codebases.
🤖 Automated Red Teaming 3 insights
Automated Systems Surpass Human Capabilities
Gray Swan's specialized system 'Shade' now finds significantly more model breaks than human red teamers in fixed-time competitions, marking a shift where trained AI exceeds human adversarial capabilities.
Why Frontier Models Make Poor Red Teamers
Major AI models refuse adversarial tasks due to safety training, making them ineffective at red teaming compared to explicitly trained specialized models designed to bypass normal behaviors.
Crowdsourced Adversarial Testing
The company operates a 15,000-member 'Arena' community hosting prize challenges to generate training data and identify vulnerabilities for frontier labs like Anthropic.
⚠️ Agent Vulnerabilities 2 insights
Indirect Prompt Injection in Coding Agents
Testing revealed that coding agents remain vulnerable to indirect prompt injection when fetching untrusted web content, allowing attackers to hijack objectives, leak data, or steal credentials.
Alien Intelligence Failure Modes
AI systems exhibit fundamentally different intelligence than humans, failing in ways people never would and vice versa, requiring new security frameworks incapable of being predicted by human intuition alone.
Bottom Line
Organizations deploying AI agents must adopt a security mindset that treats models as inherently untrusted entities prone to correlated failures, implementing automated red teaming and strict sandboxing before granting tool access or network integration.
More from Latent Space
View all
⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai
Ronak Malde explains leaving DeepMind (and $2 billion in acquisition earnings) to found Trajectory.ai, arguing that AI products must evolve from static tools into "living systems" that continually learn from real-world user corrections across enterprise verticals like legal and finance.
The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.
🔬 The Limits of AI in Science - Why We Need Self-Driving Labs — Joseph Krause, Radical AI
Joseph Krause explains why AI alone cannot discover new industrial materials—unlike biology, alloys cannot be represented as simple strings and require physical ground truth across synthesis, microstructure, and processing. Radical AI is building self-driving labs to close the loop between AI hypothesis generation and automated experimentation, aiming to compress the 15-30 year materials development timeline.
⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai
Ahmad Awais reveals how CommandCode.ai fixed DeepSeek v4's 'tool confusion' through deterministic repair logic, enabling the open-source model to outperform Claude Opus 4.7 by eliminating repetitive schema errors that previously caused an average of 56 failed tool calls per session.