Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

| Podcasts | May 04, 2026 | 9.83 Thousand views | 1:18:41

TL;DR

Pedro Rodrigues from Supabase details how structured 'skills'—markdown-based instruction sets with progressive disclosure—dramatically improve AI agent performance with complex products, distinguishing them from MCP tools and establishing an evaluation-driven development framework for systematic testing.

📁 Skills Architecture & Progressive Disclosure 3 insights

Skills function as structured knowledge books

A skill consists of a skill.md file containing front matter (name and description) that acts as an 'index on steroids,' plus optional reference files and scripts stored in a references folder.

Progressive disclosure minimizes context bloat

Unlike loading all documentation immediately, skills use progressive disclosure where only the skill.md front matter loads initially, and the agent fetches additional reference files only when specifically needed.

Reference files create navigable knowledge graphs

Skills support nested references where files can point to other files, creating graph-like structures that allow agents to traverse complex information hierarchies efficiently.

⚖️ Skills vs MCP Tools 2 insights

Complementary roles for integrations and context

MCP tools handle external integrations and remote server-side execution, while skills provide progressive context disclosure and detailed workflows that exceed tool description character limits.

Local execution environment requirements

Unlike MCP tools that run remotely, skill scripts execute locally in the user's environment, requiring OS-specific compatibility (Linux, macOS, Windows) and bash access.

🧪 Evaluation-Driven Development 3 insights

Nondeterministic testing via evaluations

Traditional unit tests fail with LLMs; evaluations (evals) assess agent reasoning steps, tool usage patterns, and behavioral outcomes rather than exact output matching.

Iterative eval-driven development cycle

The framework involves defining success metrics, creating the skill, running controlled test scenarios with inputs/expected outputs, grading agent behavior, and iterating—similar to TDD but accounting for LLM variability.

Observability platforms track agent behavior

Tools like Brain Trust provide systematic evaluation execution with full observability into agent decision-making during controlled test scenarios.

💻 Practical Implementation at Supabase 2 insights

Prioritizing DAX over traditional DX

Supabase focuses on 'Developer Experience for Agents' (DAX) rather than just human DX, optimizing how AI agents interact with their backend-as-a-service platform and Postgres databases.

Performance review application demonstration

The workshop demonstrates building a skill to guide agents in fixing database errors within a performance review app containing four employees, specifically creating SQL views to resolve data issues.

Bottom Line

Combine MCP tools for external integrations with skills for progressive context disclosure, then systematically test using evaluation-driven development cycles that assess agent reasoning rather than deterministic outputs.

More from AI Engineer

View all
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
AI Engineer AI Engineer

LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.

12 days ago · 9 points
Text Diffusion — Brendon Dillon, Google DeepMind
AI Engineer AI Engineer

Text Diffusion — Brendon Dillon, Google DeepMind

Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.

15 days ago · 8 points