Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
TL;DR
Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.
🧩 Tokenization Across Modalities 3 insights
Unified tokenization enables multimodal processing
All modalities (text, images, audio, video) are converted into token sequences through patchification or similar techniques, allowing transformers to process them using next-token prediction paradigms similar to LLMs.
Scaling laws transfer to multimodal domains
The performance improvements observed from scaling data and model sizes in language models also apply to multimodal systems, though precise scaling law equations remain underexplored compared to text-only models.
Different modalities require different attention mechanisms
While text uses causal attention, images often benefit from bidirectional attention due to differences in information density and data structure across modalities.
🔄 From Input-Only to Omni Models 3 insights
Input-multimodal models dominate current products
Most state-of-the-art systems like Gemini, Kimi, and Claude process multimodal inputs but generate text-only outputs, focusing on understanding tasks rather than generation.
Omni models enable native multimodal generation
Models like GPT-4o and the Chameleon family can generate both text and other modalities, allowing for interleaved documents containing mixed content like images and text in arbitrary order.
Prompting capabilities extend to mixed modalities
Native multimodal architectures enable complex prompting with mixed inputs, allowing models to perform planning, reasoning, and instruction following across visual, auditory, and textual information simultaneously.
⚖️ Discrete vs. Continuous Representations 3 insights
Chameleon uses discrete VQ-VAE tokenization
This approach converts images into discrete tokens using vector quantization, enabling pure autoregressive training but causing information loss that creates performance gaps in image understanding compared to continuous encoders.
Transfusion combines autoregressive and diffusion objectives
By using continuous representations with bidirectional attention for images while maintaining causal attention for text, Transfusion achieves better generation quality and token efficiency but faces an encoding dilemma for understanding tasks.
Dual encoding strategies address current limitations
State-of-the-art omni models increasingly adopt separate encodings for generation and understanding tasks to overcome the trade-offs between discrete token efficiency and continuous representation quality.
Bottom Line
Native multimodal AI requires carefully chosen architectural trade-offs between discrete tokenization (enabling unified autoregressive training but sacrificing image understanding) and continuous representations (improving generation quality but creating encoding dilemmas), with the field moving toward hybrid approaches that combine multiple encoding strategies.
More from Stanford Online
View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.
Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
Inference has emerged as the critical revenue-generating phase of AI, requiring engineers to treat serving as a full-stack discipline spanning applications to hardware, with precise workload definition being the foundation of profitable deployment.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
This final lecture synthesizes the evolution of generative modeling from discrete diffusion to continuous flow matching, emphasizing that by 2026 flow matching—specifically rectified flow variants—has become the industry default for efficient image generation.