LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

| Podcasts | June 07, 2026 | 2.09 Thousand views

TL;DR

Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.

🔍 Observability Architecture 3 insights

OpenTelemetry-first instrumentation

Arize leverages OTel standards with auto-instrumentation requiring only one line of code to capture traces and spans across any framework, creating audit records of agent behavior since code alone cannot audit non-deterministic systems.

Multi-layer visibility requirements

Comprehensive observability requires examining granular traces (individual LLM calls), sessions (state and conversation history), and trajectory distributions (all possible agent paths) to understand back-and-forth states and branching logic.

Distributional path analysis

Viewing aggregations of all agent instantiations reveals what percentage of traffic flows down specific branches, helping identify path-dependent latency issues or component ordering errors that single trace views miss.

⚖️ Evaluation Frameworks 3 insights

Five flavors of signal

Effective evaluation combines LLM-as-judge, human feedback, golden datasets for domain-specific quality tuning, deterministic logic checks (like JSON validation), and business metrics covering revenue, cost savings, and time efficiency.

Scoped evaluation depths

Evaluations operate at span level (single component I/O), multi-span (data passing between agents), trajectory level (end-to-end process completion), and session level (conversation satisfaction and state machine behavior).

Role-based collaboration model

Technical users should handle framework coding while domain experts and product managers define evaluation criteria through no-code interfaces, allowing each group to work within their expertise.

🤖 Experimentation & Automation 3 insights

Systematic improvement loops

Experimentation involves testing changes to prompts, models, and orchestration configurations against curated datasets to prevent regressions, where fixing one issue often creates unexpected failures in non-deterministic systems.

Code-native workflows

The industry is moving away from dashboards toward CLI tools and coding agents (like Arize's 'Alex') that programmatically analyze traces, detect issues such as high latency or errors, and run evaluations directly in developers' existing environments.

The fully automated flywheel

The ultimate goal is complete automation where AI agents handle observability, generate appropriate evaluations dynamically based on context, and fix systems without human intervention, compressing the entire improvement cycle into an autonomous process.

Bottom Line

AI engineering should move toward fully automated observability and evaluation flywheels where AI agents handle detection, diagnosis, and fixing, allowing developers to focus on building rather than manual monitoring.

More from AI Engineer

View all
Text Diffusion — Brendon Dillon, Google DeepMind
AI Engineer AI Engineer

Text Diffusion — Brendon Dillon, Google DeepMind

Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.

4 days ago · 8 points
AI Engineer Melbourne 2026 Keynote Livestream | Day 2
1:05:31
AI Engineer AI Engineer

AI Engineer Melbourne 2026 Keynote Livestream | Day 2

Jeremy Howard argues that AI coding tools risk trapping developers in addictive 'dark flow' states that diminish psychological well-being, drawing on Self-Determination Theory to advocate for intentional AI use that augments human mastery and autonomy rather than outsourcing complexity.

4 days ago · 9 points
How to talk to statues — Joe Reeve, ElevenLabs
33:28
AI Engineer AI Engineer

How to talk to statues — Joe Reeve, ElevenLabs

Joe Reeve from ElevenLabs discusses building a viral AI app that lets users talk to statues via phone calls, exploring how vibe coding with existing APIs enables rapid prototyping, the unique challenges of voice interface design, and the cultural implications of giving physical objects AI-generated voices.

7 days ago · 9 points