LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
TL;DR
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
🔍 Observability Architecture 3 insights
OpenTelemetry-first instrumentation
Arize leverages OTel standards with auto-instrumentation requiring only one line of code to capture traces and spans across any framework, creating audit records of agent behavior since code alone cannot audit non-deterministic systems.
Multi-layer visibility requirements
Comprehensive observability requires examining granular traces (individual LLM calls), sessions (state and conversation history), and trajectory distributions (all possible agent paths) to understand back-and-forth states and branching logic.
Distributional path analysis
Viewing aggregations of all agent instantiations reveals what percentage of traffic flows down specific branches, helping identify path-dependent latency issues or component ordering errors that single trace views miss.
⚖️ Evaluation Frameworks 3 insights
Five flavors of signal
Effective evaluation combines LLM-as-judge, human feedback, golden datasets for domain-specific quality tuning, deterministic logic checks (like JSON validation), and business metrics covering revenue, cost savings, and time efficiency.
Scoped evaluation depths
Evaluations operate at span level (single component I/O), multi-span (data passing between agents), trajectory level (end-to-end process completion), and session level (conversation satisfaction and state machine behavior).
Role-based collaboration model
Technical users should handle framework coding while domain experts and product managers define evaluation criteria through no-code interfaces, allowing each group to work within their expertise.
🤖 Experimentation & Automation 3 insights
Systematic improvement loops
Experimentation involves testing changes to prompts, models, and orchestration configurations against curated datasets to prevent regressions, where fixing one issue often creates unexpected failures in non-deterministic systems.
Code-native workflows
The industry is moving away from dashboards toward CLI tools and coding agents (like Arize's 'Alex') that programmatically analyze traces, detect issues such as high latency or errors, and run evaluations directly in developers' existing environments.
The fully automated flywheel
The ultimate goal is complete automation where AI agents handle observability, generate appropriate evaluations dynamically based on context, and fix systems without human intervention, compressing the entire improvement cycle into an autonomous process.
Bottom Line
AI engineering should move toward fully automated observability and evaluation flywheels where AI agents handle detection, diagnosis, and fixing, allowing developers to focus on building rather than manual monitoring.
More from AI Engineer
View all
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.
AI Engineer Melbourne 2026 Keynote Livestream | Day 2
Jeremy Howard argues that AI coding tools risk trapping developers in addictive 'dark flow' states that diminish psychological well-being, drawing on Self-Determination Theory to advocate for intentional AI use that augments human mastery and autonomy rather than outsourcing complexity.
How to talk to statues — Joe Reeve, ElevenLabs
Joe Reeve from ElevenLabs discusses building a viral AI app that lets users talk to statues via phone calls, exploring how vibe coding with existing APIs enables rapid prototyping, the unique challenges of voice interface design, and the cultural implications of giving physical objects AI-generated voices.
How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS
Nick Nisi from WorkOS explains how deleting 95% of his AI agent's skills improved accuracy from 77% to 97%, detailing his 'Case' harness system that uses state machines and cryptographic proof to enforce accountability rather than relying on instructions.