Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning

| Podcasts | April 02, 2026 | 2.99 Thousand views | 1:06:48

TL;DR

Moonlake founders Fan-yun Sun and Chris Manning argue that true world models require action-conditioned symbolic reasoning about physics and consequences, not just pixel prediction, enabling spatial intelligence with orders of magnitude less data than pure scaling approaches.

🌍 What World Models Actually Are 2 insights

Action-conditioned prediction separates world models from video generators

Unlike Sora-style video generators that predict pixels, true world models must predict the consequences of specific actions minutes into the future, requiring understanding of 3D physics and object permanence.

Long-term consistency requires semantic abstraction

Maintaining coherent game states or simulations for extended periods requires abstract symbolic representations rather than processing raw pixels, as evidenced by human cognitive processing that filters most visual input.

🏗️ Structure vs. Scale Thesis 2 insights

Symbolic abstraction enables five orders of magnitude efficiency gains

While not rejecting the bitter lesson entirely, Moonlake bets that structured reasoning traces incorporating geometry, physics, and affordances can achieve what pixel-only models require exponentially more data and compute to learn.

Current video data lacks essential action labels

Mining observational videos from the internet fails to capture the actions causing state changes, making it difficult to learn causal relationships without expensive action-conditioned simulation data.

💬 The Role of Language in Spatial Intelligence 2 insights

Language serves as a cognitive tool for abstraction

Following philosopher Dan Dennett, Manning argues language provides unique symbolic knowledge representation that enabled human evolutionary advantages in planning and tool-building beyond what vision alone provides.

Philosophical divergence from LeCun's visual-centric JEPA

Moonlake fundamentally disagrees with Yann LeCun's dismissal of symbolic representations, asserting that latent visual abstractions alone cannot achieve the causal reasoning and long-term planning necessary for embodied AI.

Bottom Line

Building world models requires structured symbolic reasoning about actions and physics rather than pure pixel prediction, leveraging language as a cognitive tool to achieve efficient, consistent spatial intelligence.

More from Latent Space

View all
Marc Andreessen introspects on Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"
1:16:20
Latent Space Latent Space

Marc Andreessen introspects on Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"

Marc Andreessen frames artificial intelligence as an '80-year overnight success,' arguing that while the field has cycled through boom-bust periods since 1943, the current convergence of LLMs, reasoning models, agents, and recursive self-improvement represents a permanent inflection point where the technology finally 'works' at scale, justifying the view that 'this time is different' for builders and investors.

3 days ago · 9 points