Playground in Prod - Optimising Agents in Production Environments — Samuel Colvin, Pydantic

| Podcasts | May 07, 2026 | 5.8 Thousand views

TL;DR

Samuel Colvin demonstrates optimizing AI agent prompts in production using Jepper, a genetic algorithm library that breeds high-performing prompt variations, combined with Logfire's managed variables for structured configuration and deterministic evaluation against golden datasets.

🧬 Jepper: Genetic Prompt Optimization 2 insights

Pareto Frontier Breeding Strategy

Jepper optimizes prompts using genetic algorithms that selectively breed candidates from the Pareto frontier of best performers, similar to selective racehorse breeding rather than random mutation.

String Optimization Beyond Text

The library optimizes any string value, whether simple text prompts or JSON data containing complex structured configurations.

⚙️ Managed Variables and Production Infrastructure 3 insights

Structured Configuration Management

Logfire's managed variables support any object definable by a Pydantic model, extending beyond simple text prompts to enable management of complex structured agent parameters.

AI Observability as Feature Not Category

Colvin argues that AI observability will eventually be absorbed by general observability platforms or AI frameworks, serving as a feature rather than a standalone category.

Autonomous Optimization Pipeline

The platform is evolving toward autonomous agent optimization where variables are tuned directly from the observability interface without manual intervention.

📊 Deterministic Evaluation Methodology 3 insights

Golden Dataset Over LLM Judges

Deterministic evaluators comparing outputs against verified golden datasets provide more reliable benchmarks than LLM-as-judge approaches, which Colvin describes as 'lunatics running the asylum'.

Political Dynasty Extraction Demo

The demonstration uses Pydantic AI with structured outputs to analyze Wikipedia data for UK MPs, specifically optimizing prompts to identify ancestral political relationships while filtering out spouses and siblings.

Pydantic Gateway for Model Access

The gateway service provides unified API access to multiple model providers with built-in observability, caching, and fallback capabilities for production environments.

Bottom Line

Use deterministic evaluations against golden datasets rather than LLM-as-judge for reliable agent benchmarking, and implement prompt optimization through genetic algorithms that breed high-performing variations rather than relying on random mutation.

More from AI Engineer

View all
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
AI Engineer AI Engineer

LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.

15 days ago · 9 points
Text Diffusion — Brendon Dillon, Google DeepMind
AI Engineer AI Engineer

Text Diffusion — Brendon Dillon, Google DeepMind

Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.

18 days ago · 8 points