[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire
TL;DR
Goodfire researchers discuss how mechanistic interpretability has evolved from pure research to practical deployment in 2025, highlighting production applications like PII detection and scientific discovery while navigating the field's pivot toward 'pragmatic' tools that prioritize real-world utility over complete mechanistic understanding.
🏭 Production Deployments & Enterprise Use 3 insights
PII detection via feature probing
Racketin deploys a 'sidecar' interpretability model that detects when PII-related features fire in customer chats, achieving higher recall than LLM judges at 1/500th the cost.
Interpretability enters model cards
Major labs now integrate interpretability into evaluation workflows, with techniques appearing in Gemini 3 and Claude 4 model cards and red teaming processes.
AI for scientific discovery
Researchers apply interpretability to superhuman biological models in genomics and proteomics to identify novel disease biomarkers from opaque 'base pairs in, base pairs out' systems.
🔬 Unsupervised Techniques & Model Control 3 insights
Direct latent manipulation
Goodfire's paint.goodfire.ai tool allows users to manipulate Stable Diffusion's internal feature space directly, enabling users to drag and position unsupervised concepts like animals without text prompts.
Memorization follows a spectrum
Recent research demonstrates that memorization ranges from rote storage of repeated documents to logical reasoning capabilities, with factual recall existing between these extremes and proving difficult to disentangle from core cognition.
Cross-layer circuit tracing
The Topics paper introduces cross-layer transcoders that construct attribution graphs mapping model computations across layers, though Goodfire's replication efforts confirm these remain computationally intensive to scale.
🎯 The Pragmatic Turn in Interpretability 3 insights
Beyond alignment science
DeepMind's pivot to 'pragmatic interpretability' signals an industry shift toward tools that solve immediate deployment challenges rather than pursuing complete mechanistic understanding of model internals.
Technique-specific tooling
Effective deployment requires matching methods to use cases, such as feature probing for runtime monitoring, circuit tracing for alignment verification, and specialized approaches for post-training model diffing.
Limits of knowledge editing
Updating specific facts remains intractable because knowledge is entangled with reasoning capabilities, making true 'unlearning' impossible and fact editing risky without broader cognitive side effects.
Bottom Line
Mechanistic interpretability is transitioning from academic research to practical engineering, with 2025 marking the shift toward production deployments in privacy, science, and monitoring—but success depends on selecting the right interpretability technique for each specific use case rather than pursuing universal mechanistic understanding.
More from Latent Space
View all
🔬There Is No AlphaFold for Materials — AI for Materials Discovery with Heather Kulik
MIT professor Heather Kulik explains how AI discovered quantum phenomena to create 4x tougher polymers and why materials science lacks an 'AlphaFold' equivalent due to missing experimental datasets, emphasizing that domain expertise remains essential to validate AI predictions in chemistry.
Dreamer: the Agent OS for Everyone — David Singleton
David Singleton introduces Dreamer as an 'Agent OS' that combines a personal AI Sidekick with a marketplace of tools and agents, enabling both non-technical users and engineers to build, customize, and deploy AI applications through natural language while maintaining privacy through centralized, OS-level architecture.
Why Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork/Code
Anthropic's Felix Rieseberg explains why AI agents need their own virtual computers to be effective, arguing that confining Claude to chat interfaces severely limits capability. He details how this philosophy shaped Claude Cowork and why product development is shifting from lengthy planning to rapidly building multiple prototypes simultaneously.
⚡️Monty: the ultrafast Python interpreter by Agents for Agents — Samuel Colvin, Pydantic
Samuel Colvin from Pydantic introduces Monty, a Rust-based Python interpreter designed specifically for AI agents that achieves sub-microsecond execution latency by running in-process, bridging the gap between rigid tool calling and heavy containerized sandboxes.