Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
TL;DR
This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.
🎯 Multimodal Foundations 3 insights
The omni model vision
Future AI systems must handle arbitrary combinations of text, image, audio, and video as both inputs and outputs, requiring a unified architecture that processes any modality.
Tokenization beyond text
Unlike text subwords which carry semantic meaning, raw pixels and audio samples are not semantically meaningful units and must be converted into discrete or continuous tokens via specialized encoders.
Transformers as universal processors
Despite being designed for text, transformers remain the dominant architecture across all modalities at scale, necessitating methods to feed non-sequential data into token-based systems.
🖼️ CLIP Architecture and Training 3 insights
Contrastive learning objective
CLIP trains by maximizing dot products between matching image-text embeddings within a batch while minimizing mismatches, effectively performing multi-class classification across 32,000+ examples simultaneously.
Web-scale training data
The model leverages 400 million noisy image-text pairs scraped from the internet rather than curated datasets like ImageNet, demonstrating that semantically aligned web captions provide sufficient supervision.
Zero-shot breakthrough
Without task-specific fine-tuning, CLIP outperformed ResNet models trained on 1.2 million ImageNet annotations, eliminating the need for expensive manual labeling through Amazon Mechanical Turk.
⚙️ Technical Implementation 3 insights
Vision Transformer encoding
CLIP uses a ViT-L/14 encoder that splits 336x336 images into 14x14 patches with 1D positional embeddings, applying attention pooling rather than averaging to aggregate patch representations into single vectors.
Batch size constraints
The contrastive loss requires massive batch sizes of approximately 32,000 because the softmax operates over the full batch, making the process computationally intensive and difficult to parallelize compared to standard language modeling.
SigLIP efficiency improvement
Google's SigLIP replaces CLIP's batch-level softmax with binary sigmoid classification that evaluates individual image-text pairs independently, improving scalability and training efficiency.
Bottom Line
Modern multimodal AI relies on converting images into semantic tokens through encoders like CLIP's Vision Transformer, which learns aligned representations from noisy web-scale data using contrastive objectives rather than curated manual annotations.
More from Stanford Online
View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.
Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
Inference has emerged as the critical revenue-generating phase of AI, requiring engineers to treat serving as a full-stack discipline spanning applications to hardware, with precise workload definition being the foundation of profitable deployment.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
This final lecture synthesizes the evolution of generative modeling from discrete diffusion to continuous flow matching, emphasizing that by 2026 flow matching—specifically rectified flow variants—has become the industry default for efficient image generation.