[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire

| Podcasts | December 31, 2025 | 980 views | 21:48

TL;DR

Goodfire researchers discuss how mechanistic interpretability has evolved from pure research to practical deployment in 2025, highlighting production applications like PII detection and scientific discovery while navigating the field's pivot toward 'pragmatic' tools that prioritize real-world utility over complete mechanistic understanding.

🏭 Production Deployments & Enterprise Use 3 insights

PII detection via feature probing

Racketin deploys a 'sidecar' interpretability model that detects when PII-related features fire in customer chats, achieving higher recall than LLM judges at 1/500th the cost.

Interpretability enters model cards

Major labs now integrate interpretability into evaluation workflows, with techniques appearing in Gemini 3 and Claude 4 model cards and red teaming processes.

AI for scientific discovery

Researchers apply interpretability to superhuman biological models in genomics and proteomics to identify novel disease biomarkers from opaque 'base pairs in, base pairs out' systems.

🔬 Unsupervised Techniques & Model Control 3 insights

Direct latent manipulation

Goodfire's paint.goodfire.ai tool allows users to manipulate Stable Diffusion's internal feature space directly, enabling users to drag and position unsupervised concepts like animals without text prompts.

Memorization follows a spectrum

Recent research demonstrates that memorization ranges from rote storage of repeated documents to logical reasoning capabilities, with factual recall existing between these extremes and proving difficult to disentangle from core cognition.

Cross-layer circuit tracing

The Topics paper introduces cross-layer transcoders that construct attribution graphs mapping model computations across layers, though Goodfire's replication efforts confirm these remain computationally intensive to scale.

🎯 The Pragmatic Turn in Interpretability 3 insights

Beyond alignment science

DeepMind's pivot to 'pragmatic interpretability' signals an industry shift toward tools that solve immediate deployment challenges rather than pursuing complete mechanistic understanding of model internals.

Technique-specific tooling

Effective deployment requires matching methods to use cases, such as feature probing for runtime monitoring, circuit tracing for alignment verification, and specialized approaches for post-training model diffing.

Limits of knowledge editing

Updating specific facts remains intractable because knowledge is entangled with reasoning capabilities, making true 'unlearning' impossible and fact editing risky without broader cognitive side effects.

Bottom Line

Mechanistic interpretability is transitioning from academic research to practical engineering, with 2025 marking the shift toward production deployments in privacy, science, and monitoring—but success depends on selecting the right interpretability technique for each specific use case rather than pursuing universal mechanistic understanding.

More from Latent Space

View all
CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify
1:14:30
Latent Space Latent Space

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

Shopify CTO Mikhail Parakhin reveals that AI agents have achieved nearly 100% daily adoption among developers, driving a 30% month-over-month surge in PR merges that is breaking traditional CI/CD pipelines, and argues that organizations must shift from parallel token-burning agents to high-latency, critique-loop architectures using expensive pro-level models for code review.

17 days ago · 10 points