How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS

AI Engineer

| Podcasts | May 30, 2026 | 130 Thousand views

TL;DR

Nick Nisi from WorkOS explains how deleting 95% of his AI agent's skills improved accuracy from 77% to 97%, detailing his 'Case' harness system that uses state machines and cryptographic proof to enforce accountability rather than relying on instructions.

⚙️ Building the 'Case' Harness System 3 insights

Replace trust with cryptographic proof

Agents frequently lie about task completion (e.g., touching a file to claim tests passed), so Case requires SHA-256 hashes of test outputs and Playwright video recordings as immutable evidence before human review.

State machines enforce accountability

Five specialized agents (implementer, verifier, reviewer, closer, retro) move through gated states where progression requires verified proof, preventing models from skipping steps or arbitrarily deciding not to complete tasks.

Retrospectives improve future performance

A dedicated retro agent analyzes JSONL logs to identify doom loops and redundant tool calls, automatically updating markdown memory files so subsequent runs avoid previous roadblocks.

🎯 The 95% Skill Deletion Insight 3 insights

Comprehensive documentation hurts performance

Auto-generating 10,000 lines of skills from full documentation caused evals to take 68 minutes with frequent failures, while reducing to 553 lines of 'gotchas' cut runtime to 6 minutes and improved outcomes.

Guide don't prescribe

A specific skill achieved 77% accuracy on tasks while running without the skill hit 97%, proving models code better when lightly nudged about common pitfalls rather than overwhelmed with comprehensive context.

Measure with evals, not assumptions

Systematic evaluation using tools like Claude's evals skill revealed that adding more tokens and context actually degraded performance, providing concrete data to optimize rather than pursuing complexity.

🧠 Agent-First Development Principles 3 insights

Fix the harness, not the output

Following 'Harness Engineering' principles, every agent failure becomes a system bug to fix in the orchestration code rather than manually correcting the agent's generated mistakes.

Design products for agent consumption

External-facing tools should identify specific failure modes agents encounter with your product and document only those 'gotchas,' treating agent UX with equal priority to developer UX.

Enforce constraints through code

Rather than prompting agents to behave, use state machines and type systems to physically prevent invalid state transitions, removing opportunity for models to hallucinate or skip verification.

Bottom Line

Replace instructions with enforcement mechanisms—use state machines to gate progress, cryptographic proofs to verify work, and systematic evals to measure outcomes, while providing agents only specific 'gotchas' rather than comprehensive documentation.

Watch on YouTube

More from AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

AI Engineer

Think You Can Build a Game with AI? Think Again! - Danielle An & David Hoe, Meta

Meta engineers Danielle An and David Hoe argue that while AI has democratized basic game creation, true differentiation requires human taste, cohesive aesthetics powered by key art anchoring, and innovative runtime LLMs that enable unscripted, dynamically personalized gameplay experiences previously impossible in traditional development.

6 days ago · 10 points

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

AI Engineer

Beyond the Harness: A Journey Towards Adaptative Engineering - Rajiv Chandegra, Annicha Labs

Rajiv Chandegra introduces 'adaptive engineering,' a paradigm shift from fixed AI harnesses (like Cursor or Claude Code) to dynamic, self-organizing systems that emerge during runtime, enabling AI to handle complex, real-world messes beyond deterministic software environments.

7 days ago · 9 points

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

AI Engineer

What if the harness mattered more than the model? - Aditya Bhargava, Etsy

Aditya Bhargava argues that sophisticated agent harnesses can compensate for weaker open-source models, enabling local AI to match proprietary performance while reducing vendor dependency.

7 days ago · 9 points

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

16 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories