Shipping complex AI applications — Braintrust & Trainline
TL;DR
This workshop demonstrates how to bridge the gap between AI prototypes and production systems using Brain Trust's observability platform, featuring Trainline's experience deploying multi-agent AI applications serving 27 million users.
🔧 The Production Gap in AI Systems 3 insights
POCs fail to reach production
Organizations successfully build local demos but struggle to industrialize them due to the non-deterministic nature of LLM systems compared to traditional deterministic software engineering.
Prompt patching isn't sustainable
Developers often fix production issues by tweaking prompts locally without systematic tracking, creating a cycle where failures repeat until proper operational workflows are established.
Observability beats logging
While logs record what happened in a system, true observability enables deep behavioral analysis to understand why AI systems produce specific outputs or failures.
🧠 Brain Trust's Platform & Methodology 3 insights
Series B momentum and valuation
Brain Trust has raised $80 million at an $800 million valuation to build AI observability infrastructure, founded by Ankur Goel after his previous company Imper was acquired by Figma.
Brainstorm database architecture
The platform utilizes a proprietary database category called Brainstorm specifically designed for semi-structured AI trace data that traditional analytical systems cannot handle at scale.
The evaluation flywheel approach
Teams should implement continuous cycles of instrumentation, failure mode identification through golden datasets, systematic remediation, and production monitoring to incrementally improve quality without targeting perfect 100% coverage.
🚄 Trainline's Multi-Agent Implementation 4 insights
Massive operational scale
Trainline sells 6.3 billion tickets annually across European rail networks to 27 million active users, requiring AI systems that maintain reliability under significant production load.
Agentic travel assistant capabilities
Their system extends beyond chatbots to autonomously handle complex actions including refunds, ticket changes, and disruption management through multi-agent workflows without requiring human handover.
Hybrid ML architecture
Trainline combines classical machine learning models for predicting train disruptions with modern LLM-based agentic systems to address diverse customer service needs.
Real-world signal evaluation
Production AI deployment requires capturing and evaluating actual user interactions rather than relying solely on synthetic test datasets to identify true edge cases and failure modes.
Bottom Line
Implement a continuous evaluation flywheel—instrument your AI system, identify failure modes with golden datasets, remediate systematically, and monitor production traces—to ship reliable agentic applications at scale.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.