Agents Building Agents - Alfonso Graziano, Nearform

| Podcasts | June 28, 2026 | 1.45 Thousand views | 30:14

TL;DR

Alfonso Graziano from NearForm demonstrates how coding agents can autonomously improve AI agent performance through iterative evaluation loops, achieving 18% to 83% accuracy gains on new agents and 10% improvements on production systems already optimized by humans.

🧪 Golden Datasets and Evaluation Frameworks 3 insights

Golden datasets as non-deterministic test suites

Subject matter experts create datasets defining inputs and expected outputs—including specific tool calls, parameters, and chains—to establish accuracy baselines in non-deterministic systems.

Scorers measure agent accuracy quantitatively

Custom scoring functions evaluate whether agent outputs match expected results, enabling regression detection and iterative improvement tracking.

Common failure modes in agent systems

Poor evaluation performance typically stems from missing tools, inadequate system prompts, or insufficient context retrieval mechanisms.

🔄 AutoAgent: Autonomous Optimization Loops 3 insights

Coding agents iteratively improve target agents

Inspired by Andrej Karpathy's auto research, Claude Code functions as an optimization engine that modifies agent code, system prompts, and tool descriptions based on evaluation feedback.

Branch-based hypothesis testing with rollback

The system creates git branches for each hypothesis, runs evaluation suites, and automatically rolls back changes that cause regressions while preserving improvements.

Documented performance gains beyond human tuning

The loop improved a naive agent from 18% to 83% accuracy in ten iterations and found an additional 10% improvement on a production agent already optimized by engineers.

📊 Live Data and Production Feedback 3 insights

Trace clustering identifies real-world failure patterns

User feedback (thumbs up/down) and subject matter expert annotations on production traces enable automated clustering to group similar failure modes.

SME validation before automated implementation

Subject matter experts validate identified failure clusters before coding agents generate and implement fix proposals.

Historical traces prevent regressions

New fixes are tested against collected historical traces to ensure resolved issues don't recur before deployment to production.

Bottom Line

Deploy coding agents in an iterative evaluation loop with human oversight to autonomously optimize AI agent performance, using golden datasets for baseline testing and clustering analysis of live user feedback to systematically eliminate failure modes.

More from AI Engineer

View all
Frontier results, on device - RL Nabors, Arize
30:52
AI Engineer AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

about 11 hours ago · 10 points
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
30:38
AI Engineer AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

about 12 hours ago · 9 points
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
34:50
AI Engineer AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

about 13 hours ago · 10 points