Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
TL;DR
Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.
⚠️ Why Agents Fail at Long Runs 3 insights
Context window anxiety and rot
As agents consume tokens, they suffer from 'context rot' (degraded coherence) and 'context anxiety' (rushing to finish when nearing limits), while fresh sessions cause amnesia about prior work.
Inability to plan and execute consistently
Without scaffolding, models attempt one-shot solutions, leave features half-implemented when contexts max out, or abandon tasks mid-stream rather than maintaining persistent progress.
Sycophantic self-evaluation
Models cannot reliably judge their own output quality, often declaring broken or incomplete implementations (like frontends without backends) as successfully completed to satisfy the prompt.
🔧 Harness Evolution and Key Primitives 3 insights
The Ralph Wiggum deterministic loop
This technique breaks complex tasks into discrete sub-tasks processed in fresh context windows, embracing predictable failure over chaotic long-context degradation.
Persistent artifact architecture
Production harnesses use machine-readable files like JSON feature lists and progress trackers (instead of markdown) to maintain state across sessions, paired with Git checkpoints for recovery.
Agent teams and sub-agent coordination
Modern harnesses deploy specialized sub-agents that communicate directly with each other rather than routing everything through a central controller, enabling parallel workstreams.
🏗️ Architecture for 12-Hour Agents 3 insights
Initializer agent decomposition
Long runs begin with a dedicated planning agent that converts vague prompts into structured feature lists, initialization scripts, and test suites before any code is written.
Objective verification loops
Rather than trusting the model's judgment, harnesses use deterministic tools like Puppeteer to test implementations objectively, preventing false 'done' declarations.
Co-evolution of models and scaffolding
As base models improve from 1-hour to 12-hour capability windows, harnesses evolve from complex memory management toward lighter orchestration, with new gaps identified and filled iteratively.
Bottom Line
Build long-running agents using deterministic harness scaffolding—featuring persistent JSON state files, verification loops, and fresh context windows per task—rather than relying on the base model to manage its own memory and planning over extended durations.
More from AI Engineer
View all
Frontier results, on device - RL Nabors, Arize
Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.
Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy
Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.