Playground in Prod - Optimising Agents in Production Environments — Samuel Colvin, Pydantic

| Podcasts | May 07, 2026 | 2.52 Thousand views

TL;DR

Samuel Colvin demonstrates optimizing AI agent prompts in production using Jepper, a genetic algorithm library that breeds high-performing prompt variations, combined with Logfire's managed variables for structured configuration and deterministic evaluation against golden datasets.

🧬 Jepper: Genetic Prompt Optimization 2 insights

Pareto Frontier Breeding Strategy

Jepper optimizes prompts using genetic algorithms that selectively breed candidates from the Pareto frontier of best performers, similar to selective racehorse breeding rather than random mutation.

String Optimization Beyond Text

The library optimizes any string value, whether simple text prompts or JSON data containing complex structured configurations.

⚙️ Managed Variables and Production Infrastructure 3 insights

Structured Configuration Management

Logfire's managed variables support any object definable by a Pydantic model, extending beyond simple text prompts to enable management of complex structured agent parameters.

AI Observability as Feature Not Category

Colvin argues that AI observability will eventually be absorbed by general observability platforms or AI frameworks, serving as a feature rather than a standalone category.

Autonomous Optimization Pipeline

The platform is evolving toward autonomous agent optimization where variables are tuned directly from the observability interface without manual intervention.

📊 Deterministic Evaluation Methodology 3 insights

Golden Dataset Over LLM Judges

Deterministic evaluators comparing outputs against verified golden datasets provide more reliable benchmarks than LLM-as-judge approaches, which Colvin describes as 'lunatics running the asylum'.

Political Dynasty Extraction Demo

The demonstration uses Pydantic AI with structured outputs to analyze Wikipedia data for UK MPs, specifically optimizing prompts to identify ancestral political relationships while filtering out spouses and siblings.

Pydantic Gateway for Model Access

The gateway service provides unified API access to multiple model providers with built-in observability, caching, and fallback capabilities for production environments.

Bottom Line

Use deterministic evaluations against golden datasets rather than LLM-as-judge for reliable agent benchmarking, and implement prompt optimization through genetic algorithms that breed high-performing variations rather than relying on random mutation.

More from AI Engineer

View all
Vibe Engineering Effect Apps — Michael Arnaldi, Effectful
1:43:04
AI Engineer AI Engineer

Vibe Engineering Effect Apps — Michael Arnaldi, Effectful

Michael Arnaldi demonstrates "vibe engineering" by building a TypeScript project with AI agents, revealing that cloning library repositories directly into your codebase—rather than using npm packages—enables AI to learn patterns from source code, while strict TypeScript and custom lint rules act as essential guardrails.

1 day ago · 8 points
Skills at Scale — Nick Nisi and Zack Proser, WorkOS
AI Engineer AI Engineer

Skills at Scale — Nick Nisi and Zack Proser, WorkOS

Nick Nisi and Zack Proser from WorkOS demonstrate how 'skills'—portable, markdown-based context units—solve the 'cold start' problem of AI coding agents by encoding constraints and deterministic scripts that can be shared across teams and projects, eliminating repetitive context reloading.

2 days ago · 10 points