Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

| Podcasts | May 14, 2026 | 1.43 Thousand views | 1:20:07

TL;DR

Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.

🚇 The 'Mind the Gap' Framework 3 insights

Bridge Requirements and Reality Gaps

The London Tube analogy illustrates how agents (trains) often misalign with platform requirements, necessitating continuous evaluation as both evolve independently.

Implement Guardrails as Safety Warnings

Observability must include guardrails that alert users to dangers while monitoring exactly how customers engage with agent capabilities.

Monitor Continuously Across Time

Developers need persistent monitoring of agent fleets, not just single deployment checks, as user behavior and environments constantly change.

🔍 Three-Phase Observability Strategy 3 insights

Instrument Early with OpenTelemetry

Build observability in from day one using Hotel standards to trace heterogeneous agents across different hosting platforms into a unified control plane.

Debug Complex Multi-Agent Workflows

Evaluate specific workflow stages—intent resolution, tool calls, and task completion—to precisely identify where non-deterministic behavior emerges.

Scale to Fleet-Wide Management

Progress from single-agent monitoring to centralized observability across multiple multi-agent systems using Azure Monitor integration.

🛡️ Evaluation Types and Safety 3 insights

Evaluate Holistic Agent Performance

Move beyond LLM output scoring to assess end-to-end agent workflows using built-in quality metrics and custom evaluators tailored to specific scenarios.

Red Team Before Production

Proactively attack agents using open-source tools like Microsoft's Pirate repository to uncover vulnerabilities that standard quality checks miss.

Distinguish Quality from Security

Quality evaluation verifies normal operations while safeguarding specifically tests adversarial resilience against prompt attacks and malicious inputs.

⚙️ Developer Workflow Integration 2 insights

Solve Cold Start Evaluation Problems

Address the challenge of building prototypes with no existing data and selecting from 11,000+ models by using structured evaluation frameworks.

Close the Continuous Optimize Loop

Transform observability data into concrete improvements through continuous evaluation during code changes and scheduled production assessments.

Bottom Line

Implement observability from day one using OpenTelemetry standards to trace multi-agent workflows, continuously evaluate at every stage from intent resolution to task completion, and integrate red teaming to close the gap between expected and actual agent behavior.

More from AI Engineer

View all
Give Your Agent a Computer — Nico Albanese, Vercel
AI Engineer AI Engineer

Give Your Agent a Computer — Nico Albanese, Vercel

Nico Albanese demonstrates building AI agents with Vercel's AI SDK 6, introducing the new tool loop agent pattern and three essential building blocks for 2026: agent runtimes, sophisticated tool ecosystems, and sandboxed computer environments for state persistence and code execution.

3 days ago · 8 points
Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski
AI Engineer AI Engineer

Viktor: AI Coworker That Lives in Slack — Fryderyk Wiatrowski

Fryderyk Wiatrowski presents Victor, an AI employee that lives natively in Slack to automate complex cross-functional tasks by leveraging shared company context and 3,000+ tool integrations, evolving from early browser-based agents to solve the unique memory and permission challenges of multi-user enterprise environments.

4 days ago · 8 points