Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford Online

| Podcasts | June 04, 2026 | 108 Thousand views | 1:04:40

TL;DR

Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.

🧩 Tokenization Across Modalities 3 insights

Unified tokenization enables multimodal processing

All modalities (text, images, audio, video) are converted into token sequences through patchification or similar techniques, allowing transformers to process them using next-token prediction paradigms similar to LLMs.

Scaling laws transfer to multimodal domains

The performance improvements observed from scaling data and model sizes in language models also apply to multimodal systems, though precise scaling law equations remain underexplored compared to text-only models.

Different modalities require different attention mechanisms

While text uses causal attention, images often benefit from bidirectional attention due to differences in information density and data structure across modalities.

🔄 From Input-Only to Omni Models 3 insights

Input-multimodal models dominate current products

Most state-of-the-art systems like Gemini, Kimi, and Claude process multimodal inputs but generate text-only outputs, focusing on understanding tasks rather than generation.

Omni models enable native multimodal generation

Models like GPT-4o and the Chameleon family can generate both text and other modalities, allowing for interleaved documents containing mixed content like images and text in arbitrary order.

Prompting capabilities extend to mixed modalities

Native multimodal architectures enable complex prompting with mixed inputs, allowing models to perform planning, reasoning, and instruction following across visual, auditory, and textual information simultaneously.

⚖️ Discrete vs. Continuous Representations 3 insights

Chameleon uses discrete VQ-VAE tokenization

This approach converts images into discrete tokens using vector quantization, enabling pure autoregressive training but causing information loss that creates performance gaps in image understanding compared to continuous encoders.

Transfusion combines autoregressive and diffusion objectives

By using continuous representations with bidirectional attention for images while maintaining causal attention for text, Transfusion achieves better generation quality and token efficiency but faces an encoding dilemma for understanding tasks.

Dual encoding strategies address current limitations

State-of-the-art omni models increasingly adopt separate encodings for generation and understanding tasks to overcome the trade-offs between discrete token efficiency and continuous representation quality.

Bottom Line

Native multimodal AI requires carefully chosen architectural trade-offs between discrete tokenization (enabling unified autoregressive training but sacrificing image understanding) and continuous representations (improving generation quality but creating encoding dilemmas), with the field moving toward hybrid approaches that combine multiple encoding strategies.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

12 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

26 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories