[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire
TL;DR
Goodfire researchers discuss how mechanistic interpretability has evolved from pure research to practical deployment in 2025, highlighting production applications like PII detection and scientific discovery while navigating the field's pivot toward 'pragmatic' tools that prioritize real-world utility over complete mechanistic understanding.
🏭 Production Deployments & Enterprise Use 3 insights
PII detection via feature probing
Racketin deploys a 'sidecar' interpretability model that detects when PII-related features fire in customer chats, achieving higher recall than LLM judges at 1/500th the cost.
Interpretability enters model cards
Major labs now integrate interpretability into evaluation workflows, with techniques appearing in Gemini 3 and Claude 4 model cards and red teaming processes.
AI for scientific discovery
Researchers apply interpretability to superhuman biological models in genomics and proteomics to identify novel disease biomarkers from opaque 'base pairs in, base pairs out' systems.
🔬 Unsupervised Techniques & Model Control 3 insights
Direct latent manipulation
Goodfire's paint.goodfire.ai tool allows users to manipulate Stable Diffusion's internal feature space directly, enabling users to drag and position unsupervised concepts like animals without text prompts.
Memorization follows a spectrum
Recent research demonstrates that memorization ranges from rote storage of repeated documents to logical reasoning capabilities, with factual recall existing between these extremes and proving difficult to disentangle from core cognition.
Cross-layer circuit tracing
The Topics paper introduces cross-layer transcoders that construct attribution graphs mapping model computations across layers, though Goodfire's replication efforts confirm these remain computationally intensive to scale.
🎯 The Pragmatic Turn in Interpretability 3 insights
Beyond alignment science
DeepMind's pivot to 'pragmatic interpretability' signals an industry shift toward tools that solve immediate deployment challenges rather than pursuing complete mechanistic understanding of model internals.
Technique-specific tooling
Effective deployment requires matching methods to use cases, such as feature probing for runtime monitoring, circuit tracing for alignment verification, and specialized approaches for post-training model diffing.
Limits of knowledge editing
Updating specific facts remains intractable because knowledge is entangled with reasoning capabilities, making true 'unlearning' impossible and fact editing risky without broader cognitive side effects.
Bottom Line
Mechanistic interpretability is transitioning from academic research to practical engineering, with 2025 marking the shift toward production deployments in privacy, science, and monitoring—but success depends on selecting the right interpretability technique for each specific use case rather than pursuing universal mechanistic understanding.
More from Latent Space
View all
The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin
Matei Zaharia and Reynold Xin detail Databricks' open-source 'Agent Cloud' platform (Omnigen), arguing that standardized protocols and persistent infrastructure—not just better models—will determine which enterprises successfully deploy collaborative, secure AI agents at scale.
AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan
Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.
⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai
Ronak Malde explains leaving DeepMind (and $2 billion in acquisition earnings) to found Trajectory.ai, arguing that AI products must evolve from static tools into "living systems" that continually learn from real-world user corrections across enterprise verticals like legal and finance.
The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.