Shipping complex AI applications — Braintrust & Trainline
TL;DR
This workshop demonstrates how to bridge the gap between AI prototypes and production systems using Brain Trust's observability platform, featuring Trainline's experience deploying multi-agent AI applications serving 27 million users.
🔧 The Production Gap in AI Systems 3 insights
POCs fail to reach production
Organizations successfully build local demos but struggle to industrialize them due to the non-deterministic nature of LLM systems compared to traditional deterministic software engineering.
Prompt patching isn't sustainable
Developers often fix production issues by tweaking prompts locally without systematic tracking, creating a cycle where failures repeat until proper operational workflows are established.
Observability beats logging
While logs record what happened in a system, true observability enables deep behavioral analysis to understand why AI systems produce specific outputs or failures.
🧠 Brain Trust's Platform & Methodology 3 insights
Series B momentum and valuation
Brain Trust has raised $80 million at an $800 million valuation to build AI observability infrastructure, founded by Ankur Goel after his previous company Imper was acquired by Figma.
Brainstorm database architecture
The platform utilizes a proprietary database category called Brainstorm specifically designed for semi-structured AI trace data that traditional analytical systems cannot handle at scale.
The evaluation flywheel approach
Teams should implement continuous cycles of instrumentation, failure mode identification through golden datasets, systematic remediation, and production monitoring to incrementally improve quality without targeting perfect 100% coverage.
🚄 Trainline's Multi-Agent Implementation 4 insights
Massive operational scale
Trainline sells 6.3 billion tickets annually across European rail networks to 27 million active users, requiring AI systems that maintain reliability under significant production load.
Agentic travel assistant capabilities
Their system extends beyond chatbots to autonomously handle complex actions including refunds, ticket changes, and disruption management through multi-agent workflows without requiring human handover.
Hybrid ML architecture
Trainline combines classical machine learning models for predicting train disruptions with modern LLM-based agentic systems to address diverse customer service needs.
Real-world signal evaluation
Production AI deployment requires capturing and evaluating actual user interactions rather than relying solely on synthetic test datasets to identify true edge cases and failure modes.
Bottom Line
Implement a continuous evaluation flywheel—instrument your AI system, identify failure modes with golden datasets, remediate systematically, and monitor production traces—to ship reliable agentic applications at scale.
More from AI Engineer
View all
Human-in-the-Loop Automation with n8n — Liam McGarrigle
Liam McGarrigle demonstrates building AI agents in n8n using visual workflows, emphasizing transparent orchestration over black-box automation through configurable memory, chat triggers, and tool integration for practical business applications.
Mastering AI Pricing: Flexible & Agile Monetization — Mayank Pant, Stripe
AI companies are growing three times faster than traditional SaaS but face unique pricing challenges due to unpredictable compute costs and razor-thin margins, requiring a shift from static subscription models to flexible hybrid pricing that prioritizes rapid iteration and customer-perceived value over technical metrics.
Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind
Google DeepMind engineers Thor Schaeff and Philipp Schmid demonstrate building conversational agents using the new Gemini Interactions API, a unified interface that supports both direct model inference and complex autonomous agents like Deep Research with server-side state management and asynchronous execution.
Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor
David Gomes from Cursor details how they replaced 15,000 lines of complex git work tree management code with a 200-line markdown skill using agent primitives, drastically reducing maintenance while enabling multi-repo support and flexible model comparisons, though requiring new approaches to ensure agent isolation.