Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

| Podcasts | May 14, 2026 | 3.32 Thousand views | 1:20:07

TL;DR

Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.

🚇 The 'Mind the Gap' Framework 3 insights

Bridge Requirements and Reality Gaps

The London Tube analogy illustrates how agents (trains) often misalign with platform requirements, necessitating continuous evaluation as both evolve independently.

Implement Guardrails as Safety Warnings

Observability must include guardrails that alert users to dangers while monitoring exactly how customers engage with agent capabilities.

Monitor Continuously Across Time

Developers need persistent monitoring of agent fleets, not just single deployment checks, as user behavior and environments constantly change.

🔍 Three-Phase Observability Strategy 3 insights

Instrument Early with OpenTelemetry

Build observability in from day one using Hotel standards to trace heterogeneous agents across different hosting platforms into a unified control plane.

Debug Complex Multi-Agent Workflows

Evaluate specific workflow stages—intent resolution, tool calls, and task completion—to precisely identify where non-deterministic behavior emerges.

Scale to Fleet-Wide Management

Progress from single-agent monitoring to centralized observability across multiple multi-agent systems using Azure Monitor integration.

🛡️ Evaluation Types and Safety 3 insights

Evaluate Holistic Agent Performance

Move beyond LLM output scoring to assess end-to-end agent workflows using built-in quality metrics and custom evaluators tailored to specific scenarios.

Red Team Before Production

Proactively attack agents using open-source tools like Microsoft's Pirate repository to uncover vulnerabilities that standard quality checks miss.

Distinguish Quality from Security

Quality evaluation verifies normal operations while safeguarding specifically tests adversarial resilience against prompt attacks and malicious inputs.

⚙️ Developer Workflow Integration 2 insights

Solve Cold Start Evaluation Problems

Address the challenge of building prototypes with no existing data and selecting from 11,000+ models by using structured evaluation frameworks.

Close the Continuous Optimize Loop

Transform observability data into concrete improvements through continuous evaluation during code changes and scheduled production assessments.

Bottom Line

Implement observability from day one using OpenTelemetry standards to trace multi-agent workflows, continuously evaluate at every stage from intent resolution to task completion, and integrate red teaming to close the gap between expected and actual agent behavior.

More from AI Engineer

View all
Frontier results, on device - RL Nabors, Arize
30:52
AI Engineer AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

about 11 hours ago · 10 points
The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents
30:38
AI Engineer AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

about 12 hours ago · 9 points
The Agentic AI Engineer - Benedikt Sanftl, Mutagent
34:50
AI Engineer AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

about 13 hours ago · 10 points