Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase
TL;DR
Pedro Rodrigues from Supabase details how structured 'skills'—markdown-based instruction sets with progressive disclosure—dramatically improve AI agent performance with complex products, distinguishing them from MCP tools and establishing an evaluation-driven development framework for systematic testing.
📁 Skills Architecture & Progressive Disclosure 3 insights
Skills function as structured knowledge books
A skill consists of a skill.md file containing front matter (name and description) that acts as an 'index on steroids,' plus optional reference files and scripts stored in a references folder.
Progressive disclosure minimizes context bloat
Unlike loading all documentation immediately, skills use progressive disclosure where only the skill.md front matter loads initially, and the agent fetches additional reference files only when specifically needed.
Reference files create navigable knowledge graphs
Skills support nested references where files can point to other files, creating graph-like structures that allow agents to traverse complex information hierarchies efficiently.
⚖️ Skills vs MCP Tools 2 insights
Complementary roles for integrations and context
MCP tools handle external integrations and remote server-side execution, while skills provide progressive context disclosure and detailed workflows that exceed tool description character limits.
Local execution environment requirements
Unlike MCP tools that run remotely, skill scripts execute locally in the user's environment, requiring OS-specific compatibility (Linux, macOS, Windows) and bash access.
🧪 Evaluation-Driven Development 3 insights
Nondeterministic testing via evaluations
Traditional unit tests fail with LLMs; evaluations (evals) assess agent reasoning steps, tool usage patterns, and behavioral outcomes rather than exact output matching.
Iterative eval-driven development cycle
The framework involves defining success metrics, creating the skill, running controlled test scenarios with inputs/expected outputs, grading agent behavior, and iterating—similar to TDD but accounting for LLM variability.
Observability platforms track agent behavior
Tools like Brain Trust provide systematic evaluation execution with full observability into agent decision-making during controlled test scenarios.
💻 Practical Implementation at Supabase 2 insights
Prioritizing DAX over traditional DX
Supabase focuses on 'Developer Experience for Agents' (DAX) rather than just human DX, optimizing how AI agents interact with their backend-as-a-service platform and Postgres databases.
Performance review application demonstration
The workshop demonstrates building a skill to guide agents in fixing database errors within a performance review app containing four employees, specifically creating SQL views to resolve data issues.
Bottom Line
Combine MCP tools for external integrations with skills for progressive context disclosure, then systematically test using evaluation-driven development cycles that assess agent reasoning rather than deterministic outputs.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.