[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire
TL;DR
Goodfire researchers discuss how mechanistic interpretability has evolved from pure research to practical deployment in 2025, highlighting production applications like PII detection and scientific discovery while navigating the field's pivot toward 'pragmatic' tools that prioritize real-world utility over complete mechanistic understanding.
🏭 Production Deployments & Enterprise Use 3 insights
PII detection via feature probing
Racketin deploys a 'sidecar' interpretability model that detects when PII-related features fire in customer chats, achieving higher recall than LLM judges at 1/500th the cost.
Interpretability enters model cards
Major labs now integrate interpretability into evaluation workflows, with techniques appearing in Gemini 3 and Claude 4 model cards and red teaming processes.
AI for scientific discovery
Researchers apply interpretability to superhuman biological models in genomics and proteomics to identify novel disease biomarkers from opaque 'base pairs in, base pairs out' systems.
🔬 Unsupervised Techniques & Model Control 3 insights
Direct latent manipulation
Goodfire's paint.goodfire.ai tool allows users to manipulate Stable Diffusion's internal feature space directly, enabling users to drag and position unsupervised concepts like animals without text prompts.
Memorization follows a spectrum
Recent research demonstrates that memorization ranges from rote storage of repeated documents to logical reasoning capabilities, with factual recall existing between these extremes and proving difficult to disentangle from core cognition.
Cross-layer circuit tracing
The Topics paper introduces cross-layer transcoders that construct attribution graphs mapping model computations across layers, though Goodfire's replication efforts confirm these remain computationally intensive to scale.
🎯 The Pragmatic Turn in Interpretability 3 insights
Beyond alignment science
DeepMind's pivot to 'pragmatic interpretability' signals an industry shift toward tools that solve immediate deployment challenges rather than pursuing complete mechanistic understanding of model internals.
Technique-specific tooling
Effective deployment requires matching methods to use cases, such as feature probing for runtime monitoring, circuit tracing for alignment verification, and specialized approaches for post-training model diffing.
Limits of knowledge editing
Updating specific facts remains intractable because knowledge is entangled with reasoning capabilities, making true 'unlearning' impossible and fact editing risky without broader cognitive side effects.
Bottom Line
Mechanistic interpretability is transitioning from academic research to practical engineering, with 2025 marking the shift toward production deployments in privacy, science, and monitoring—but success depends on selecting the right interpretability technique for each specific use case rather than pursuing universal mechanistic understanding.
More from Latent Space
View all
🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found
Physicist Alex Lubyansky discusses how GPT-5 and reasoning models like o3 have achieved superhuman capabilities in theoretical physics, solving the year-long mystery of single minus gluon tree amplitudes and reproducing complex research in minutes rather than months.
The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition
Applied Intuition is building the unified 'Android for physical machines' to solve OS fragmentation across vehicles and industrial equipment, enabling modern AI deployment through simulation tools, proprietary operating systems, and end-to-end autonomy models with a 1,000-engineer team.
CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify
Shopify CTO Mikhail Parakhin reveals that AI agents have achieved nearly 100% daily adoption among developers, driving a 30% month-over-month surge in PR merges that is breaking traditional CI/CD pipelines, and argues that organizations must shift from parallel token-burning agents to high-latency, critique-loop architectures using expensive pro-level models for code review.
🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Noetik is tackling the 95% failure rate of cancer clinical trials by training transformers on proprietary multimodal patient tumor data to identify hidden biological subtypes and match therapies to responsive populations, moving beyond simplistic biomarkers and outdated cell lines.