Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

| Podcasts | May 14, 2026 | 2.64 Thousand views

TL;DR

Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.

🧪 The Evaluation Imperative 3 insights

Evals solve the 'vibes problem'

Manual testing fails because agents break on edge cases, adversarial inputs, and unexpected vocabulary that developers miss during casual testing but users trigger immediately.

Traditional unit tests fail for LLMs

Standard string matching doesn't work when semantically identical outputs vary in wording, requiring evals that understand meaning rather than exact text matches.

Traces provide evaluation data

Traces capture every LLM invocation, tool call, and intermediate step as nested JSON spans, creating the structured log data necessary for systematic debugging and testing.

⚖️ Three Complementary Evaluation Types 3 insights

Code evals check deterministic constraints

Use deterministic functions to validate JSON format, token limits, forbidden phrases, and required fields since they execute in milliseconds at negligible cost.

LLM judges assess semantic quality

Deploy more powerful LLMs as judges to evaluate tone, factual accuracy, and faithfulness to source material using rubric-based prompts, despite higher latency and nondeterminism.

Human evals build golden datasets

Reserve human judgment for creating ground-truth datasets and validating failure modes, though human annotators exhibit roughly 50% error rates due to fatigue at scale.

🤖 Agent-Specific Complexity 3 insights

Cascading failures multiply errors

Early missteps in tool selection or parameter parsing compound through subsequent calls, potentially producing radically wrong outputs like researching Nikola Tesla instead of Tesla stock.

Multi-agent systems add routing challenges

Evaluation must verify that routing LLMs select correct sub-agents and that information transfers accurately between specialized agents in complex workflows.

Over-prescriptive evals block valid optimizations

Avoid mandating specific tool sequences in evals, as upgraded models may discover more efficient valid paths that differ from expected execution patterns.

🔄 Implementation Strategy 3 insights

Analyze traces before writing evals

Categorize actual production failures by reading trace data to determine what metrics to measure, rather than guessing requirements before seeing real error patterns.

Meta-evaluation verifies judge accuracy

Test LLM judges against labeled datasets to ensure they correctly identify success and failure cases, preventing misaligned evals from corrupting your feedback loop.

Evals enable safe model upgrades

Comprehensive eval suites allow teams to switch between models like Claude 4.5 and 4.6 or modify prompts without introducing regressions that break existing functionality.

Bottom Line

Implement a hybrid evaluation pipeline combining deterministic code checks, LLM judges for semantic analysis, and human-validated golden datasets, integrated into CI to catch cascading failures before production deployment.

More from AI Engineer

View all
Give Your Agent a Computer — Nico Albanese, Vercel
AI Engineer AI Engineer

Give Your Agent a Computer — Nico Albanese, Vercel

Nico Albanese demonstrates building AI agents with Vercel's AI SDK 6, introducing the new tool loop agent pattern and three essential building blocks for 2026: agent runtimes, sophisticated tool ecosystems, and sandboxed computer environments for state persistence and code execution.

3 days ago · 8 points
Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski
AI Engineer AI Engineer

Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski

Fryderyk Wiatrowski presents Victor, an AI employee that lives natively in Slack to automate complex cross-functional tasks by leveraging shared company context and 3,000+ tool integrations, evolving from early browser-based agents to solve the unique memory and permission challenges of multi-user enterprise environments.

4 days ago · 8 points