Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

AI Engineer

| Podcasts | May 14, 2026 | 2.64 Thousand views

TL;DR

Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.

🧪 The Evaluation Imperative 3 insights

Evals solve the 'vibes problem'

Manual testing fails because agents break on edge cases, adversarial inputs, and unexpected vocabulary that developers miss during casual testing but users trigger immediately.

Traditional unit tests fail for LLMs

Standard string matching doesn't work when semantically identical outputs vary in wording, requiring evals that understand meaning rather than exact text matches.

Traces provide evaluation data

Traces capture every LLM invocation, tool call, and intermediate step as nested JSON spans, creating the structured log data necessary for systematic debugging and testing.

⚖️ Three Complementary Evaluation Types 3 insights

Code evals check deterministic constraints

Use deterministic functions to validate JSON format, token limits, forbidden phrases, and required fields since they execute in milliseconds at negligible cost.

LLM judges assess semantic quality

Deploy more powerful LLMs as judges to evaluate tone, factual accuracy, and faithfulness to source material using rubric-based prompts, despite higher latency and nondeterminism.

Human evals build golden datasets

Reserve human judgment for creating ground-truth datasets and validating failure modes, though human annotators exhibit roughly 50% error rates due to fatigue at scale.

🤖 Agent-Specific Complexity 3 insights

Cascading failures multiply errors

Early missteps in tool selection or parameter parsing compound through subsequent calls, potentially producing radically wrong outputs like researching Nikola Tesla instead of Tesla stock.

Multi-agent systems add routing challenges

Evaluation must verify that routing LLMs select correct sub-agents and that information transfers accurately between specialized agents in complex workflows.

Over-prescriptive evals block valid optimizations

Avoid mandating specific tool sequences in evals, as upgraded models may discover more efficient valid paths that differ from expected execution patterns.

🔄 Implementation Strategy 3 insights

Analyze traces before writing evals

Categorize actual production failures by reading trace data to determine what metrics to measure, rather than guessing requirements before seeing real error patterns.

Meta-evaluation verifies judge accuracy

Test LLM judges against labeled datasets to ensure they correctly identify success and failure cases, preventing misaligned evals from corrupting your feedback loop.

Evals enable safe model upgrades

Comprehensive eval suites allow teams to switch between models like Claude 4.5 and 4.6 or modify prompts without introducing regressions that break existing functionality.

Bottom Line

Implement a hybrid evaluation pipeline combining deterministic code checks, LLM judges for semantic analysis, and human-validated golden datasets, integrated into CI to catch cascading failures before production deployment.

Watch on YouTube

More from AI Engineer

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

AI Engineer

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.

about 14 hours ago · 10 points

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate

AI Engineer

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate

Jonas Templestein and Misha from Iterate demonstrate a prototype event-sourced architecture for building distributed AI agent harnesses where all state changes are captured as immutable events in HTTP-accessible streams, enabling debuggability and composability across different languages and environments.

about 16 hours ago · 7 points

Give Your Agent a Computer — Nico Albanese, Vercel

AI Engineer

Give Your Agent a Computer — Nico Albanese, Vercel

Nico Albanese demonstrates building AI agents with Vercel's AI SDK 6, introducing the new tool loop agent pattern and three essential building blocks for 2026: agent runtimes, sophisticated tool ecosystems, and sandboxed computer environments for state persistence and code execution.

3 days ago · 8 points

Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski

AI Engineer

Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski

Fryderyk Wiatrowski presents Victor, an AI employee that lives natively in Slack to automate complex cross-functional tasks by leveraging shared company context and 3,000+ tool integrations, evolving from early browser-based agents to solve the unique memory and permission challenges of multi-user enterprise environments.

4 days ago · 8 points

Browse more: 🎙️ Podcasts All Videos All Categories