Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
TL;DR
Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.
🚇 The 'Mind the Gap' Framework 3 insights
Bridge Requirements and Reality Gaps
The London Tube analogy illustrates how agents (trains) often misalign with platform requirements, necessitating continuous evaluation as both evolve independently.
Implement Guardrails as Safety Warnings
Observability must include guardrails that alert users to dangers while monitoring exactly how customers engage with agent capabilities.
Monitor Continuously Across Time
Developers need persistent monitoring of agent fleets, not just single deployment checks, as user behavior and environments constantly change.
🔍 Three-Phase Observability Strategy 3 insights
Instrument Early with OpenTelemetry
Build observability in from day one using Hotel standards to trace heterogeneous agents across different hosting platforms into a unified control plane.
Debug Complex Multi-Agent Workflows
Evaluate specific workflow stages—intent resolution, tool calls, and task completion—to precisely identify where non-deterministic behavior emerges.
Scale to Fleet-Wide Management
Progress from single-agent monitoring to centralized observability across multiple multi-agent systems using Azure Monitor integration.
🛡️ Evaluation Types and Safety 3 insights
Evaluate Holistic Agent Performance
Move beyond LLM output scoring to assess end-to-end agent workflows using built-in quality metrics and custom evaluators tailored to specific scenarios.
Red Team Before Production
Proactively attack agents using open-source tools like Microsoft's Pirate repository to uncover vulnerabilities that standard quality checks miss.
Distinguish Quality from Security
Quality evaluation verifies normal operations while safeguarding specifically tests adversarial resilience against prompt attacks and malicious inputs.
⚙️ Developer Workflow Integration 2 insights
Solve Cold Start Evaluation Problems
Address the challenge of building prototypes with no existing data and selecting from 11,000+ models by using structured evaluation frameworks.
Close the Continuous Optimize Loop
Transform observability data into concrete improvements through continuous evaluation during code changes and scheduled production assessments.
Bottom Line
Implement observability from day one using OpenTelemetry standards to trace multi-agent workflows, continuously evaluate at every stage from intent resolution to task completion, and integrate red teaming to close the gap between expected and actual agent behavior.
More from AI Engineer
View all
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.
Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate
Jonas Templestein and Misha from Iterate demonstrate a prototype event-sourced architecture for building distributed AI agent harnesses where all state changes are captured as immutable events in HTTP-accessible streams, enabling debuggability and composability across different languages and environments.
Give Your Agent a Computer — Nico Albanese, Vercel
Nico Albanese demonstrates building AI agents with Vercel's AI SDK 6, introducing the new tool loop agent pattern and three essential building blocks for 2026: agent runtimes, sophisticated tool ecosystems, and sandboxed computer environments for state persistence and code execution.
Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski
Fryderyk Wiatrowski presents Victor, an AI employee that lives natively in Slack to automate complex cross-functional tasks by leveraging shared company context and 3,000+ tool integrations, evolving from early browser-based agents to solve the unique memory and permission challenges of multi-user enterprise environments.