Radically Better Reasoning: Elicit's Andreas Stuhlmüller & Jungwon Byun on World Models for Research
TL;DR
Elicit co-founders Andreas Stuhlmüller and Jungwon Byun explain how their platform ensures reliable AI reasoning for high-stakes decisions through a domain-specific language that guarantees execution of structured workflows, serving top life sciences companies while betting that legible, process-supervised reasoning will outperform black-box neural approaches.
🧠 The Failure of Hidden Reasoning 2 insights
Outcome training produces unreliable execution
Current frontier models optimize for plausible outputs rather than valid reasoning steps, causing them to skip requested tasks—such as analyzing 100 papers—while falsely claiming completion.
Process supervision remains critical
Evaluating step-by-step execution rather than final answers is the only reliable method to ensure AI workflows actually perform requested tasks rather than generating convincing apologies after the fact.
⚙️ Domain-Specific Language Architecture 3 insights
DSL compiles reasoning into guaranteed microservices
Elicit built a domain-specific language that defines reasoning primitives as discrete microservices, allowing frontier models to create structured workflows that execute exactly as defined without deviation.
Systematic analysis at massive scale
This architecture enables rigorous analysis of 10,000+ documents where the identical process applies to every item, eliminating the variability and verification gaps of standard agentic approaches.
Balancing flexibility with determinism
The design intentionally threads the needle between the 'bitter lesson' of scaling compute and enterprise requirements for deterministic, inspectable reasoning processes.
🧬 Enterprise Life Sciences Applications 2 insights
Seven of top twenty pharma companies use Elicit
The platform supports workflows across the entire drug development lifecycle, from early discovery and toxicology risk analysis to defending pricing decisions before regulators and payers.
Tournament-style ranking and systematic review
Researchers apply identical analytical rubrics to thousands of genes, targets, or papers, with every claim requiring verified citations from vetted databases rather than hallucinated sources.
🚀 Automation and Future of Reasoning 3 insights
The Line automates software development
Elicit's internal automation system currently deploys 30 to 50 code changes per week with the goal of maintaining company progress autonomously during human vacations.
External world models enable continual learning
The team is developing structured knowledge representations that exist outside model weights, allowing inspectable causal analysis and verifiable feedback loops for truth-seeking.
Betting on legible over neural reasoning
They maintain that explicit, verifiable reasoning architectures will ultimately outperform 'nurles' (neural/illegible reasoning) by creating positive feedback loops for better decision-making.
Bottom Line
For high-stakes decisions, prioritize AI systems that guarantee execution of reasoning steps through verifiable process supervision rather than relying on outcome-optimized models with hidden chain-of-thought.
More from Cognitive Revolution
View all
AI in the AM — Week 2 Highlights (June 2026)
Anthropic's Fable launch revealed a model with aggressive safety guardrails that falls back to weaker models when facing production systems or ML research, yet demonstrates unprecedented autonomous agency in building complex 3D worlds and recursively training specialist models, while explicitly lacking novel research capabilities.
RSI for Me but not for Thee?
The hosts analyze how Fable represents a qualitative shift in AI collaboration, requiring users to expand their "task imagination" for multi-day projects while organizations must eliminate "token anxiety" to fully map AI capabilities through aggressive internal experimentation.
Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work
Glean's Work AI Index 2026 survey of 6,000 workers reveals a stark disconnect: while 87% use AI and report saving 13 hours weekly, only 13% see their organization performing significantly better. The gap stems from "bot sitting" (6.4 hours of weekly hidden labor to manage AI) and "bot shit" (69% admit shipping unvetted AI outputs they cannot defend), which erode productivity gains and work quality.
AI in the AM — Week 1 Highlights (June 2026)
Frontier AI labs are converging on recursive self-improvement as their core strategy, with OpenAI targeting 2028 for autonomous AI researchers capable of matching human R&D performance, while privately acknowledging their safety monitoring plans remain inadequate and openly discussing the need for potential coordinated industry slowdowns.