Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

| Podcasts | March 05, 2026 | 287 Thousand views | 1:49:53

TL;DR

Goodfire CTO Dan Balsam and Chief Scientist Tom McGrath discuss their $150M Series B and new 'intentional design' research agenda, which aims to shape model training dynamics and loss landscapes rather than merely reverse-engineering trained models, alongside advances in geometric interpretability that map continuous conceptual manifolds rather than discrete features.

🎯 Intentional Design Paradigm 3 insights

Don't Fight Backprop

Goodfire advocates shaping loss landscapes so models naturally learn desired behaviors rather than imposing constraints that gradient descent will inevitably circumvent.

Frozen Probe Hallucination Reduction

Their proof-of-concept reduces hallucinations by running detection probes on a frozen copy of the model during training, making it easier to learn correct behaviors than to evade detection.

Safety-First Deployment Stance

While intentional design offers promise, the researchers acknowledge these techniques remain immature and should not yet be applied to frontier models.

🔬 Geometric Interpretability 3 insights

Beyond Discrete Features

The field is shifting from sparse autoencoders that label discrete concepts toward mapping continuous geometric manifolds that represent conceptual relationships in latent space.

Manifolds Enable True Generalization

Understanding these geometric structures is necessary for circuit explanations that generalize across all possible inputs rather than merely tracing individual execution paths.

Structure in Representations

Concepts like days of the week form structured geometric patterns such as circular manifolds rather than random disconnected points, driven by co-occurrence statistics in training data.

🧠 Research Breakthroughs 2 insights

Separating Memory from Reasoning

Goodfire demonstrated that removing weights specialized for fact memorization can actually improve model performance on certain reasoning tasks.

Debugging Medical AI

Their Prima collaboration revealed that an Alzheimer's diagnosis model relied on DNA fragment length rather than intended biological markers, showcasing interpretability's value for detecting spurious correlations.

Bottom Line

Rather than constraining model behavior after training, AI safety requires intentionally designing loss landscapes and training dynamics that make desirable capabilities the path of least resistance for gradient descent.

More from Cognitive Revolution

View all
Compute Improves Compute + Europe 2031
2:02:29
Cognitive Revolution Cognitive Revolution

Compute Improves Compute + Europe 2031

The hosts analyze a fragile moment in AI markets where leveraged speculation in Korean semiconductor stocks, Nvidia's aggressive buyback strategy, and regulatory delays of next-generation models reveal a financial ecosystem racing toward a potential 2028 AGI inflection point that

1 day ago · 0 points
The God We Deserve: Nonzero's Robert Wright on AI as Humanity's Ultimate Test
2:29:20
Cognitive Revolution Cognitive Revolution

The God We Deserve: Nonzero's Robert Wright on AI as Humanity's Ultimate Test

Robert Wright argues that modern AI reverses the 1956 assumption that understanding the mind must precede building intelligence, instead reverse-engineering cognition through evolutionary-like training processes that we cannot fully control, leaving humanity's survival dependent on achieving species-scale cooperation and moral enlightenment.

1 day ago · 9 points
Swyx on AI.Engineer + State of SWE
Cognitive Revolution Cognitive Revolution

Swyx on AI.Engineer + State of SWE

The hosts reflect on the need for cognitive empathy toward the Trump administration's AI safety interventions while analyzing Dean Ball's move to OpenAI to navigate frontier policy challenges, as the industry faces potential secret deployments of recursively self-improving models.

2 days ago · 9 points