Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase
TL;DR
Pedro Rodrigues from Supabase details how structured 'skills'—markdown-based instruction sets with progressive disclosure—dramatically improve AI agent performance with complex products, distinguishing them from MCP tools and establishing an evaluation-driven development framework for systematic testing.
📁 Skills Architecture & Progressive Disclosure 3 insights
Skills function as structured knowledge books
A skill consists of a skill.md file containing front matter (name and description) that acts as an 'index on steroids,' plus optional reference files and scripts stored in a references folder.
Progressive disclosure minimizes context bloat
Unlike loading all documentation immediately, skills use progressive disclosure where only the skill.md front matter loads initially, and the agent fetches additional reference files only when specifically needed.
Reference files create navigable knowledge graphs
Skills support nested references where files can point to other files, creating graph-like structures that allow agents to traverse complex information hierarchies efficiently.
⚖️ Skills vs MCP Tools 2 insights
Complementary roles for integrations and context
MCP tools handle external integrations and remote server-side execution, while skills provide progressive context disclosure and detailed workflows that exceed tool description character limits.
Local execution environment requirements
Unlike MCP tools that run remotely, skill scripts execute locally in the user's environment, requiring OS-specific compatibility (Linux, macOS, Windows) and bash access.
🧪 Evaluation-Driven Development 3 insights
Nondeterministic testing via evaluations
Traditional unit tests fail with LLMs; evaluations (evals) assess agent reasoning steps, tool usage patterns, and behavioral outcomes rather than exact output matching.
Iterative eval-driven development cycle
The framework involves defining success metrics, creating the skill, running controlled test scenarios with inputs/expected outputs, grading agent behavior, and iterating—similar to TDD but accounting for LLM variability.
Observability platforms track agent behavior
Tools like Brain Trust provide systematic evaluation execution with full observability into agent decision-making during controlled test scenarios.
💻 Practical Implementation at Supabase 2 insights
Prioritizing DAX over traditional DX
Supabase focuses on 'Developer Experience for Agents' (DAX) rather than just human DX, optimizing how AI agents interact with their backend-as-a-service platform and Postgres databases.
Performance review application demonstration
The workshop demonstrates building a skill to guide agents in fixing database errors within a performance review app containing four employees, specifically creating SQL views to resolve data issues.
Bottom Line
Combine MCP tools for external integrations with skills for progressive context disclosure, then systematically test using evaluation-driven development cycles that assess agent reasoning rather than deterministic outputs.
More from AI Engineer
View all
Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs
Angelos Perivolaropoulos from ElevenLabs demonstrates how to train a GPT-2 style language model from scratch using only PyTorch and minimal dependencies, revealing that modern LLM development relies 80% on training methodology and optimization rather than architectural novelty.
Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick
Chris Parsons introduces 'Ralph Loops'—a minimalist automation approach where repeatedly prompting an AI agent with the same task outperforms complex orchestration workflows, leveraging the model's self-correction to ship better code with less maintenance.
TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
Cormac Brick from Google AI Edge introduces Tiny LLMs (TLMs) and on-device agent capabilities powered by LiteRT-LM and the new Gemma 4 models, demonstrating how fine-tuned small models (100M-4B parameters) can now deliver sophisticated AI experiences—including multimodal reasoning and tool use—directly on mobile phones, laptops, and even Raspberry Pis without cloud dependency.
Mergeable by default: Building the context engine to save time and tokens — Peter Werry, Unblocked
Peter Werry argues that as AI agents move toward autonomous 'YOLO mode' execution, simple RAG and MCP connections fail to provide adequate organizational context, creating bottlenecks and 'satisfaction of search' failures where agents stop at superficial answers instead of understanding the historical 'why' behind code decisions.