Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning
TL;DR
Moonlake founders Fan-yun Sun and Chris Manning argue that true world models require action-conditioned symbolic reasoning about physics and consequences, not just pixel prediction, enabling spatial intelligence with orders of magnitude less data than pure scaling approaches.
🌍 What World Models Actually Are 2 insights
Action-conditioned prediction separates world models from video generators
Unlike Sora-style video generators that predict pixels, true world models must predict the consequences of specific actions minutes into the future, requiring understanding of 3D physics and object permanence.
Long-term consistency requires semantic abstraction
Maintaining coherent game states or simulations for extended periods requires abstract symbolic representations rather than processing raw pixels, as evidenced by human cognitive processing that filters most visual input.
🏗️ Structure vs. Scale Thesis 2 insights
Symbolic abstraction enables five orders of magnitude efficiency gains
While not rejecting the bitter lesson entirely, Moonlake bets that structured reasoning traces incorporating geometry, physics, and affordances can achieve what pixel-only models require exponentially more data and compute to learn.
Current video data lacks essential action labels
Mining observational videos from the internet fails to capture the actions causing state changes, making it difficult to learn causal relationships without expensive action-conditioned simulation data.
💬 The Role of Language in Spatial Intelligence 2 insights
Language serves as a cognitive tool for abstraction
Following philosopher Dan Dennett, Manning argues language provides unique symbolic knowledge representation that enabled human evolutionary advantages in planning and tool-building beyond what vision alone provides.
Philosophical divergence from LeCun's visual-centric JEPA
Moonlake fundamentally disagrees with Yann LeCun's dismissal of symbolic representations, asserting that latent visual abstractions alone cannot achieve the causal reasoning and long-term planning necessary for embodied AI.
Bottom Line
Building world models requires structured symbolic reasoning about actions and physics rather than pure pixel prediction, leveraging language as a cognitive tool to achieve efficient, consistent spatial intelligence.
More from Latent Space
View all
The Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway
Jake Cooper, founder of Railway, explains how the 'agent-native cloud' hit 3 million users and 100,000 weekly signups by betting that manual coding is obsolete, detailing their journey from a $500K/month free tier loss to bare metal infrastructure ownership.
The Next War Is Already Here — Yaroslav Azhnyuk, The Fourth Law & Noah Smith, Noahpinion
Yaroslav Azhnyuk, former pet-tech founder turned defense entrepreneur, explains how The Fourth Law is building AI-powered autonomous drones to defend Ukraine, arguing that software-defined warfare and mass manufacturing scale have fundamentally rewritten the rules of military power.
Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge's Janie Lee & Chai Asawa
Abridge is transforming from an AI documentation tool into a comprehensive clinical intelligence layer that uses ambient listening and deep EHR integration to deliver proactive decision support, aiming to eliminate physician burnout while catching critical clinical and administrative issues before the patient leaves the room.
🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found
Physicist Alex Lubyansky discusses how GPT-5 and reasoning models like o3 have achieved superhuman capabilities in theoretical physics, solving the year-long mystery of single minus gluon tree amplitudes and reproducing complex research in minutes rather than months.