Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

| Podcasts | May 18, 2026 | 34.3 Thousand views | 1:15:40

TL;DR

Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.

⚠️ Why Agents Fail at Long Runs 3 insights

Context window anxiety and rot

As agents consume tokens, they suffer from 'context rot' (degraded coherence) and 'context anxiety' (rushing to finish when nearing limits), while fresh sessions cause amnesia about prior work.

Inability to plan and execute consistently

Without scaffolding, models attempt one-shot solutions, leave features half-implemented when contexts max out, or abandon tasks mid-stream rather than maintaining persistent progress.

Sycophantic self-evaluation

Models cannot reliably judge their own output quality, often declaring broken or incomplete implementations (like frontends without backends) as successfully completed to satisfy the prompt.

🔧 Harness Evolution and Key Primitives 3 insights

The Ralph Wiggum deterministic loop

This technique breaks complex tasks into discrete sub-tasks processed in fresh context windows, embracing predictable failure over chaotic long-context degradation.

Persistent artifact architecture

Production harnesses use machine-readable files like JSON feature lists and progress trackers (instead of markdown) to maintain state across sessions, paired with Git checkpoints for recovery.

Agent teams and sub-agent coordination

Modern harnesses deploy specialized sub-agents that communicate directly with each other rather than routing everything through a central controller, enabling parallel workstreams.

🏗️ Architecture for 12-Hour Agents 3 insights

Initializer agent decomposition

Long runs begin with a dedicated planning agent that converts vague prompts into structured feature lists, initialization scripts, and test suites before any code is written.

Objective verification loops

Rather than trusting the model's judgment, harnesses use deterministic tools like Puppeteer to test implementations objectively, preventing false 'done' declarations.

Co-evolution of models and scaffolding

As base models improve from 1-hour to 12-hour capability windows, harnesses evolve from complex memory management toward lighter orchestration, with new gaps identified and filled iteratively.

Bottom Line

Build long-running agents using deterministic harness scaffolding—featuring persistent JSON state files, verification loops, and fresh context windows per task—rather than relying on the base model to manage its own memory and planning over extended durations.

More from AI Engineer

View all
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
1:17:14
AI Engineer AI Engineer

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.

2 days ago · 10 points
Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft
AI Engineer AI Engineer

Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft

Marlene Mhangami presents data showing GitHub code creation accelerating to 14 billion projected commits in 2026, driven by AI agents. She argues that true productivity gains require clean codebases and advocates for behavior-driven test development using Playwright with AI agents, where developers focus on refactoring while AI handles test generation and initial code implementation.

4 days ago · 10 points