Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

| Podcasts | May 14, 2026 | 11.5 Thousand views

TL;DR

Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.

🧪 The Evaluation Imperative 3 insights

Evals solve the 'vibes problem'

Manual testing fails because agents break on edge cases, adversarial inputs, and unexpected vocabulary that developers miss during casual testing but users trigger immediately.

Traditional unit tests fail for LLMs

Standard string matching doesn't work when semantically identical outputs vary in wording, requiring evals that understand meaning rather than exact text matches.

Traces provide evaluation data

Traces capture every LLM invocation, tool call, and intermediate step as nested JSON spans, creating the structured log data necessary for systematic debugging and testing.

⚖️ Three Complementary Evaluation Types 3 insights

Code evals check deterministic constraints

Use deterministic functions to validate JSON format, token limits, forbidden phrases, and required fields since they execute in milliseconds at negligible cost.

LLM judges assess semantic quality

Deploy more powerful LLMs as judges to evaluate tone, factual accuracy, and faithfulness to source material using rubric-based prompts, despite higher latency and nondeterminism.

Human evals build golden datasets

Reserve human judgment for creating ground-truth datasets and validating failure modes, though human annotators exhibit roughly 50% error rates due to fatigue at scale.

🤖 Agent-Specific Complexity 3 insights

Cascading failures multiply errors

Early missteps in tool selection or parameter parsing compound through subsequent calls, potentially producing radically wrong outputs like researching Nikola Tesla instead of Tesla stock.

Multi-agent systems add routing challenges

Evaluation must verify that routing LLMs select correct sub-agents and that information transfers accurately between specialized agents in complex workflows.

Over-prescriptive evals block valid optimizations

Avoid mandating specific tool sequences in evals, as upgraded models may discover more efficient valid paths that differ from expected execution patterns.

🔄 Implementation Strategy 3 insights

Analyze traces before writing evals

Categorize actual production failures by reading trace data to determine what metrics to measure, rather than guessing requirements before seeing real error patterns.

Meta-evaluation verifies judge accuracy

Test LLM judges against labeled datasets to ensure they correctly identify success and failure cases, preventing misaligned evals from corrupting your feedback loop.

Evals enable safe model upgrades

Comprehensive eval suites allow teams to switch between models like Claude 4.5 and 4.6 or modify prompts without introducing regressions that break existing functionality.

Bottom Line

Implement a hybrid evaluation pipeline combining deterministic code checks, LLM judges for semantic analysis, and human-validated golden datasets, integrated into CI to catch cascading failures before production deployment.

More from AI Engineer

View all
Frontier results, on device - RL Nabors, Arize
30:52
AI Engineer AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

about 11 hours ago · 10 points
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
30:38
AI Engineer AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

about 12 hours ago · 9 points
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
34:50
AI Engineer AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

about 13 hours ago · 10 points