LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

AI Engineer

| Podcasts | June 07, 2026 | 7.25 Thousand views

TL;DR

Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.

🔍 Observability Architecture 3 insights

OpenTelemetry-first instrumentation

Arize leverages OTel standards with auto-instrumentation requiring only one line of code to capture traces and spans across any framework, creating audit records of agent behavior since code alone cannot audit non-deterministic systems.

Multi-layer visibility requirements

Comprehensive observability requires examining granular traces (individual LLM calls), sessions (state and conversation history), and trajectory distributions (all possible agent paths) to understand back-and-forth states and branching logic.

Distributional path analysis

Viewing aggregations of all agent instantiations reveals what percentage of traffic flows down specific branches, helping identify path-dependent latency issues or component ordering errors that single trace views miss.

⚖️ Evaluation Frameworks 3 insights

Five flavors of signal

Effective evaluation combines LLM-as-judge, human feedback, golden datasets for domain-specific quality tuning, deterministic logic checks (like JSON validation), and business metrics covering revenue, cost savings, and time efficiency.

Scoped evaluation depths

Evaluations operate at span level (single component I/O), multi-span (data passing between agents), trajectory level (end-to-end process completion), and session level (conversation satisfaction and state machine behavior).

Role-based collaboration model

Technical users should handle framework coding while domain experts and product managers define evaluation criteria through no-code interfaces, allowing each group to work within their expertise.

🤖 Experimentation & Automation 3 insights

Systematic improvement loops

Experimentation involves testing changes to prompts, models, and orchestration configurations against curated datasets to prevent regressions, where fixing one issue often creates unexpected failures in non-deterministic systems.

Code-native workflows

The industry is moving away from dashboards toward CLI tools and coding agents (like Arize's 'Alex') that programmatically analyze traces, detect issues such as high latency or errors, and run evaluations directly in developers' existing environments.

The fully automated flywheel

The ultimate goal is complete automation where AI agents handle observability, generate appropriate evaluations dynamically based on context, and fix systems without human intervention, compressing the entire improvement cycle into an autonomous process.

Bottom Line

AI engineering should move toward fully automated observability and evaluation flywheels where AI agents handle detection, diagnosis, and fixing, allowing developers to focus on building rather than manual monitoring.

Watch on YouTube

More from AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

Meta engineers Danielle An and David Hoe argue that while AI has democratized basic game creation, true differentiation requires human taste, cohesive aesthetics powered by key art anchoring, and innovative runtime LLMs that enable unscripted, dynamically personalized gameplay experiences previously impossible in traditional development.

15 days ago · 10 points

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

AI Engineer

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

Rajiv Chandegra introduces 'adaptive engineering,' a paradigm shift from fixed AI harnesses (like Cursor or Claude Code) to dynamic, self-organizing systems that emerge during runtime, enabling AI to handle complex, real-world messes beyond deterministic software environments.

15 days ago · 9 points

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

AI Engineer

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

Aditya Bhargava argues that sophisticated agent harnesses can compensate for weaker open-source models, enabling local AI to match proprietary performance while reducing vendor dependency.

15 days ago · 9 points

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

24 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories