Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
TL;DR
Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.
🧪 The Evaluation Imperative 3 insights
Evals solve the 'vibes problem'
Manual testing fails because agents break on edge cases, adversarial inputs, and unexpected vocabulary that developers miss during casual testing but users trigger immediately.
Traditional unit tests fail for LLMs
Standard string matching doesn't work when semantically identical outputs vary in wording, requiring evals that understand meaning rather than exact text matches.
Traces provide evaluation data
Traces capture every LLM invocation, tool call, and intermediate step as nested JSON spans, creating the structured log data necessary for systematic debugging and testing.
⚖️ Three Complementary Evaluation Types 3 insights
Code evals check deterministic constraints
Use deterministic functions to validate JSON format, token limits, forbidden phrases, and required fields since they execute in milliseconds at negligible cost.
LLM judges assess semantic quality
Deploy more powerful LLMs as judges to evaluate tone, factual accuracy, and faithfulness to source material using rubric-based prompts, despite higher latency and nondeterminism.
Human evals build golden datasets
Reserve human judgment for creating ground-truth datasets and validating failure modes, though human annotators exhibit roughly 50% error rates due to fatigue at scale.
🤖 Agent-Specific Complexity 3 insights
Cascading failures multiply errors
Early missteps in tool selection or parameter parsing compound through subsequent calls, potentially producing radically wrong outputs like researching Nikola Tesla instead of Tesla stock.
Multi-agent systems add routing challenges
Evaluation must verify that routing LLMs select correct sub-agents and that information transfers accurately between specialized agents in complex workflows.
Over-prescriptive evals block valid optimizations
Avoid mandating specific tool sequences in evals, as upgraded models may discover more efficient valid paths that differ from expected execution patterns.
🔄 Implementation Strategy 3 insights
Analyze traces before writing evals
Categorize actual production failures by reading trace data to determine what metrics to measure, rather than guessing requirements before seeing real error patterns.
Meta-evaluation verifies judge accuracy
Test LLM judges against labeled datasets to ensure they correctly identify success and failure cases, preventing misaligned evals from corrupting your feedback loop.
Evals enable safe model upgrades
Comprehensive eval suites allow teams to switch between models like Claude 4.5 and 4.6 or modify prompts without introducing regressions that break existing functionality.
Bottom Line
Implement a hybrid evaluation pipeline combining deterministic code checks, LLM judges for semantic analysis, and human-validated golden datasets, integrated into CI to catch cascading failures before production deployment.
More from AI Engineer
View all
Frontier results, on device - RL Nabors, Arize
Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.
Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy
Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.