Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
TL;DR
Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.
🧪 The Evaluation Imperative 3 insights
Evals solve the 'vibes problem'
Manual testing fails because agents break on edge cases, adversarial inputs, and unexpected vocabulary that developers miss during casual testing but users trigger immediately.
Traditional unit tests fail for LLMs
Standard string matching doesn't work when semantically identical outputs vary in wording, requiring evals that understand meaning rather than exact text matches.
Traces provide evaluation data
Traces capture every LLM invocation, tool call, and intermediate step as nested JSON spans, creating the structured log data necessary for systematic debugging and testing.
⚖️ Three Complementary Evaluation Types 3 insights
Code evals check deterministic constraints
Use deterministic functions to validate JSON format, token limits, forbidden phrases, and required fields since they execute in milliseconds at negligible cost.
LLM judges assess semantic quality
Deploy more powerful LLMs as judges to evaluate tone, factual accuracy, and faithfulness to source material using rubric-based prompts, despite higher latency and nondeterminism.
Human evals build golden datasets
Reserve human judgment for creating ground-truth datasets and validating failure modes, though human annotators exhibit roughly 50% error rates due to fatigue at scale.
🤖 Agent-Specific Complexity 3 insights
Cascading failures multiply errors
Early missteps in tool selection or parameter parsing compound through subsequent calls, potentially producing radically wrong outputs like researching Nikola Tesla instead of Tesla stock.
Multi-agent systems add routing challenges
Evaluation must verify that routing LLMs select correct sub-agents and that information transfers accurately between specialized agents in complex workflows.
Over-prescriptive evals block valid optimizations
Avoid mandating specific tool sequences in evals, as upgraded models may discover more efficient valid paths that differ from expected execution patterns.
🔄 Implementation Strategy 3 insights
Analyze traces before writing evals
Categorize actual production failures by reading trace data to determine what metrics to measure, rather than guessing requirements before seeing real error patterns.
Meta-evaluation verifies judge accuracy
Test LLM judges against labeled datasets to ensure they correctly identify success and failure cases, preventing misaligned evals from corrupting your feedback loop.
Evals enable safe model upgrades
Comprehensive eval suites allow teams to switch between models like Claude 4.5 and 4.6 or modify prompts without introducing regressions that break existing functionality.
Bottom Line
Implement a hybrid evaluation pipeline combining deterministic code checks, LLM judges for semantic analysis, and human-validated golden datasets, integrated into CI to catch cascading failures before production deployment.
More from AI Engineer
View all
Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.
Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate
Jonas Templestein and Misha from Iterate demonstrate a prototype event-sourced architecture for building distributed AI agent harnesses where all state changes are captured as immutable events in HTTP-accessible streams, enabling debuggability and composability across different languages and environments.
Give Your Agent a Computer — Nico Albanese, Vercel
Nico Albanese demonstrates building AI agents with Vercel's AI SDK 6, introducing the new tool loop agent pattern and three essential building blocks for 2026: agent runtimes, sophisticated tool ecosystems, and sandboxed computer environments for state persistence and code execution.
Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski
Fryderyk Wiatrowski presents Victor, an AI employee that lives natively in Slack to automate complex cross-functional tasks by leveraging shared company context and 3,000+ tool integrations, evolving from early browser-based agents to solve the unique memory and permission challenges of multi-user enterprise environments.