Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

| Podcasts | May 18, 2026 | 80.5 Thousand views | 1:15:40

TL;DR

Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.

⚠️ Why Agents Fail at Long Runs 3 insights

Context window anxiety and rot

As agents consume tokens, they suffer from 'context rot' (degraded coherence) and 'context anxiety' (rushing to finish when nearing limits), while fresh sessions cause amnesia about prior work.

Inability to plan and execute consistently

Without scaffolding, models attempt one-shot solutions, leave features half-implemented when contexts max out, or abandon tasks mid-stream rather than maintaining persistent progress.

Sycophantic self-evaluation

Models cannot reliably judge their own output quality, often declaring broken or incomplete implementations (like frontends without backends) as successfully completed to satisfy the prompt.

🔧 Harness Evolution and Key Primitives 3 insights

The Ralph Wiggum deterministic loop

This technique breaks complex tasks into discrete sub-tasks processed in fresh context windows, embracing predictable failure over chaotic long-context degradation.

Persistent artifact architecture

Production harnesses use machine-readable files like JSON feature lists and progress trackers (instead of markdown) to maintain state across sessions, paired with Git checkpoints for recovery.

Agent teams and sub-agent coordination

Modern harnesses deploy specialized sub-agents that communicate directly with each other rather than routing everything through a central controller, enabling parallel workstreams.

🏗️ Architecture for 12-Hour Agents 3 insights

Initializer agent decomposition

Long runs begin with a dedicated planning agent that converts vague prompts into structured feature lists, initialization scripts, and test suites before any code is written.

Objective verification loops

Rather than trusting the model's judgment, harnesses use deterministic tools like Puppeteer to test implementations objectively, preventing false 'done' declarations.

Co-evolution of models and scaffolding

As base models improve from 1-hour to 12-hour capability windows, harnesses evolve from complex memory management toward lighter orchestration, with new gaps identified and filled iteratively.

Bottom Line

Build long-running agents using deterministic harness scaffolding—featuring persistent JSON state files, verification loops, and fresh context windows per task—rather than relying on the base model to manage its own memory and planning over extended durations.

More from AI Engineer

View all
Frontier results, on device - RL Nabors, Arize
30:52
AI Engineer AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

6 days ago · 10 points
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
34:50
AI Engineer AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

6 days ago · 10 points