Agents Building Agents - Alfonso Graziano, Nearform
TL;DR
Alfonso Graziano from NearForm demonstrates how coding agents can autonomously improve AI agent performance through iterative evaluation loops, achieving 18% to 83% accuracy gains on new agents and 10% improvements on production systems already optimized by humans.
🧪 Golden Datasets and Evaluation Frameworks 3 insights
Golden datasets as non-deterministic test suites
Subject matter experts create datasets defining inputs and expected outputs—including specific tool calls, parameters, and chains—to establish accuracy baselines in non-deterministic systems.
Scorers measure agent accuracy quantitatively
Custom scoring functions evaluate whether agent outputs match expected results, enabling regression detection and iterative improvement tracking.
Common failure modes in agent systems
Poor evaluation performance typically stems from missing tools, inadequate system prompts, or insufficient context retrieval mechanisms.
🔄 AutoAgent: Autonomous Optimization Loops 3 insights
Coding agents iteratively improve target agents
Inspired by Andrej Karpathy's auto research, Claude Code functions as an optimization engine that modifies agent code, system prompts, and tool descriptions based on evaluation feedback.
Branch-based hypothesis testing with rollback
The system creates git branches for each hypothesis, runs evaluation suites, and automatically rolls back changes that cause regressions while preserving improvements.
Documented performance gains beyond human tuning
The loop improved a naive agent from 18% to 83% accuracy in ten iterations and found an additional 10% improvement on a production agent already optimized by engineers.
📊 Live Data and Production Feedback 3 insights
Trace clustering identifies real-world failure patterns
User feedback (thumbs up/down) and subject matter expert annotations on production traces enable automated clustering to group similar failure modes.
SME validation before automated implementation
Subject matter experts validate identified failure clusters before coding agents generate and implement fix proposals.
Historical traces prevent regressions
New fixes are tested against collected historical traces to ensure resolved issues don't recur before deployment to production.
Bottom Line
Deploy coding agents in an iterative evaluation loop with human oversight to autonomously optimize AI agent performance, using golden datasets for baseline testing and clustering analysis of live user feedback to systematically eliminate failure modes.
More from AI Engineer
View all
Frontier results, on device - RL Nabors, Arize
Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.
Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy
Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.