Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
TL;DR
Goodfire CTO Dan Balsam and Chief Scientist Tom McGrath discuss their $150M Series B and new 'intentional design' research agenda, which aims to shape model training dynamics and loss landscapes rather than merely reverse-engineering trained models, alongside advances in geometric interpretability that map continuous conceptual manifolds rather than discrete features.
🎯 Intentional Design Paradigm 3 insights
Don't Fight Backprop
Goodfire advocates shaping loss landscapes so models naturally learn desired behaviors rather than imposing constraints that gradient descent will inevitably circumvent.
Frozen Probe Hallucination Reduction
Their proof-of-concept reduces hallucinations by running detection probes on a frozen copy of the model during training, making it easier to learn correct behaviors than to evade detection.
Safety-First Deployment Stance
While intentional design offers promise, the researchers acknowledge these techniques remain immature and should not yet be applied to frontier models.
🔬 Geometric Interpretability 3 insights
Beyond Discrete Features
The field is shifting from sparse autoencoders that label discrete concepts toward mapping continuous geometric manifolds that represent conceptual relationships in latent space.
Manifolds Enable True Generalization
Understanding these geometric structures is necessary for circuit explanations that generalize across all possible inputs rather than merely tracing individual execution paths.
Structure in Representations
Concepts like days of the week form structured geometric patterns such as circular manifolds rather than random disconnected points, driven by co-occurrence statistics in training data.
🧠 Research Breakthroughs 2 insights
Separating Memory from Reasoning
Goodfire demonstrated that removing weights specialized for fact memorization can actually improve model performance on certain reasoning tasks.
Debugging Medical AI
Their Prima collaboration revealed that an Alzheimer's diagnosis model relied on DNA fragment length rather than intended biological markers, showcasing interpretability's value for detecting spurious correlations.
Bottom Line
Rather than constraining model behavior after training, AI safety requires intentionally designing loss landscapes and training dynamics that make desirable capabilities the path of least resistance for gradient descent.
More from Cognitive Revolution
View all
Scaling Intelligence Out: Cisco's Vision for the Internet of Cognition, with Vijoy Pandey
Cisco's Outshift SVP Vijoy Pandey introduces the 'Internet of Cognition'—higher-order protocols enabling distributed AI agents to share context and collaborate across organizational boundaries, contrasting with centralized frontier models and demonstrated through internal systems that automate 40% of site reliability tasks.
Your Agent's Self-Improving Swiss Army Knife: Composio CTO Karan Vaidya on Building Smart Tools
Composio CTO Karan Vaidya explains how their platform serves as an agentic tool execution layer, providing AI agents with 50,000+ integrations through just-in-time discovery, managed authentication, and a self-improving pipeline that converts failures into optimized skills in real time.
AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF
Nathan Labenz delivers a rapid-fire survey of the current AI landscape, documenting breakthrough capabilities in reasoning and autonomous agents alongside alarming emergent behaviors like safety test recognition and internal dialect formation, while arguing that outdated critiques regarding hallucinations and comprehension no longer apply to frontier models.
Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn
AI systems are rapidly approaching capabilities that could enable extremists or lone actors to engineer pandemic-capable pathogens using publicly available biological data. Jassi Pannu argues for implementing tiered access controls on the roughly 1% of "functional" biological data that conveys dangerous capabilities while keeping beneficial research open, supplemented by broader defense-in-depth strategies.