How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne, Google DeepMind
TL;DR
Google DeepMind engineers Ian Ballantyne and KP Sawhney demonstrate their internal "Antigravity" agent platform, revealing how the organization manages massive-scale deployment through strict quota controls, hybrid model architectures, and collaborative multi-agent workflows while grappling with token consumption costs and evaluation complexity.
🛠️ Antigravity: Internal Agent Infrastructure 3 insights
IDE-integrated agent manager with browser control
Antigravity provides a VS Code-style interface where agents can spawn multiple instances, control browsers, inspect DOM elements, and capture screenshots or videos of interactions.
Human-in-the-loop workflow
The system generates editable implementation plans and scratchpad notes that allow users to review, modify, or reject agent actions before proceeding.
Built-in planning and tool use
Agents autonomously create to-do lists and utilize internal tools to analyze specifications, write code, and verify implementations through browser automation.
⚡ Scaling Challenges & Resource Management 3 insights
Token hunger as primary bottleneck
Managing per-user and per-team quotas is critical at Google scale to prevent power users from overwhelming systems with token-hungry agentic workflows.
Hybrid model routing strategy
DeepMind mixes free local models like Gemma 4 for general tasks with advanced APIs for specific components to optimize costs and quota efficiency.
Incompatibility of subscription models
Traditional subscription pricing breaks down for agentic systems, necessitating new metering approaches as brute-force quota limits are currently the primary defense.
🔮 Evolution of Deep Research & Multi-Agent Systems 3 insights
Shift from monolithic to collaborative architecture
Deep Research is evolving from passing massive text blobs between stages to agents collaborating via shared filesystem workspaces.
Darwinian skill library approach
Internal users contribute to a sprawling library of skills where only the best survive through organizational natural selection, democratizing expert knowledge.
Human-supervised digital assembly lines
The future vision involves humans overseeing parallel agent tracks that communicate efficiently, moving beyond current single-agent limitations.
🔍 Evaluation & Observability 3 insights
Custom trajectory stores and tracing
DeepMind built internal tools including agent trajectory stores to diagnose exact failure points, looping behaviors, and hierarchical drill-downs to raw model requests.
Expensive evaluation infrastructure
Testing requires sandboxed environments and mock TPUs to validate agent harnesses without consuming scarce production compute resources.
Skills preferred over MCP
While both are supported, KP Sawhney advocates for skills combined with guardrail CLI interactions over MCP, viewing the latter as potentially transient except for authentication.
Bottom Line
Organizations scaling agents must implement strict quota controls and hybrid model routing while transitioning from monolithic architectures to collaborative multi-agent systems with shared workspaces to manage costs and improve reliability.
More from AI Engineer
View all
Your Agent Is an Infinite Canvas — RL Nabors, Dressed for Space
Rachel Lee Neighbors argues that chat interfaces are merely a transitional phase like the CLI was to GUI, demonstrating how HTTP-based MCP servers and interactive MCP apps can turn agents into an 'infinite canvas' for rich web experiences while eliminating inefficient DOM scraping through emerging Web MCP standards.
Prompt to Pipeline: Building with Google's Gen Media Stack — Paige & Guillaume, Google DeepMind
Paige from Google DeepMind demonstrates how Gemini 3.1's native multimodal capabilities and AI Studio enable developers to prototype complex media pipelines—from video analysis to code execution—that can be deployed to production with a single click, while advising against building infrastructure that frontier models will soon absorb.
Fast Models Need Slow Developers — Sarah Chieng, Cerebras
As AI coding models like Codex Spark reach 1,200 tokens per second—20x faster than current standards—developers must abandon bad habits formed during the era of slow inference. This talk outlines a practical playbook for "slow development": orchestrating fast models for execution while using slower, smarter models for planning, and treating AI as a real-time pair programmer requiring constant verification and strict context management.
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.