Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
TL;DR
Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.
⚠️ Why Agents Fail at Long Runs 3 insights
Context window anxiety and rot
As agents consume tokens, they suffer from 'context rot' (degraded coherence) and 'context anxiety' (rushing to finish when nearing limits), while fresh sessions cause amnesia about prior work.
Inability to plan and execute consistently
Without scaffolding, models attempt one-shot solutions, leave features half-implemented when contexts max out, or abandon tasks mid-stream rather than maintaining persistent progress.
Sycophantic self-evaluation
Models cannot reliably judge their own output quality, often declaring broken or incomplete implementations (like frontends without backends) as successfully completed to satisfy the prompt.
🔧 Harness Evolution and Key Primitives 3 insights
The Ralph Wiggum deterministic loop
This technique breaks complex tasks into discrete sub-tasks processed in fresh context windows, embracing predictable failure over chaotic long-context degradation.
Persistent artifact architecture
Production harnesses use machine-readable files like JSON feature lists and progress trackers (instead of markdown) to maintain state across sessions, paired with Git checkpoints for recovery.
Agent teams and sub-agent coordination
Modern harnesses deploy specialized sub-agents that communicate directly with each other rather than routing everything through a central controller, enabling parallel workstreams.
🏗️ Architecture for 12-Hour Agents 3 insights
Initializer agent decomposition
Long runs begin with a dedicated planning agent that converts vague prompts into structured feature lists, initialization scripts, and test suites before any code is written.
Objective verification loops
Rather than trusting the model's judgment, harnesses use deterministic tools like Puppeteer to test implementations objectively, preventing false 'done' declarations.
Co-evolution of models and scaffolding
As base models improve from 1-hour to 12-hour capability windows, harnesses evolve from complex memory management toward lighter orchestration, with new gaps identified and filled iteratively.
Bottom Line
Build long-running agents using deterministic harness scaffolding—featuring persistent JSON state files, verification loops, and fresh context windows per task—rather than relying on the base model to manage its own memory and planning over extended durations.
More from AI Engineer
View all
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.
Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft
Marlene Mhangami presents data showing GitHub code creation accelerating to 14 billion projected commits in 2026, driven by AI agents. She argues that true productivity gains require clean codebases and advocates for behavior-driven test development using Playwright with AI agents, where developers focus on refactoring while AI handles test generation and initial code implementation.
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.
Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.