Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality

| Podcasts | June 04, 2026 | 467 views | 1:17:40

TL;DR

This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.

🎯 Multimodal Foundations 3 insights

The omni model vision

Future AI systems must handle arbitrary combinations of text, image, audio, and video as both inputs and outputs, requiring a unified architecture that processes any modality.

Tokenization beyond text

Unlike text subwords which carry semantic meaning, raw pixels and audio samples are not semantically meaningful units and must be converted into discrete or continuous tokens via specialized encoders.

Transformers as universal processors

Despite being designed for text, transformers remain the dominant architecture across all modalities at scale, necessitating methods to feed non-sequential data into token-based systems.

🖼️ CLIP Architecture and Training 3 insights

Contrastive learning objective

CLIP trains by maximizing dot products between matching image-text embeddings within a batch while minimizing mismatches, effectively performing multi-class classification across 32,000+ examples simultaneously.

Web-scale training data

The model leverages 400 million noisy image-text pairs scraped from the internet rather than curated datasets like ImageNet, demonstrating that semantically aligned web captions provide sufficient supervision.

Zero-shot breakthrough

Without task-specific fine-tuning, CLIP outperformed ResNet models trained on 1.2 million ImageNet annotations, eliminating the need for expensive manual labeling through Amazon Mechanical Turk.

⚙️ Technical Implementation 3 insights

Vision Transformer encoding

CLIP uses a ViT-L/14 encoder that splits 336x336 images into 14x14 patches with 1D positional embeddings, applying attention pooling rather than averaging to aggregate patch representations into single vectors.

Batch size constraints

The contrastive loss requires massive batch sizes of approximately 32,000 because the softmax operates over the full batch, making the process computationally intensive and difficult to parallelize compared to standard language modeling.

SigLIP efficiency improvement

Google's SigLIP replaces CLIP's batch-level softmax with binary sigmoid classification that evaluates individual image-text pairs independently, improving scalability and training efficiency.

Bottom Line

Modern multimodal AI relies on converting images into semantic tokens through encoders like CLIP's Vision Transformer, which learns aligned representations from noisy web-scale data using contrastive objectives rather than curated manual annotations.

More from Stanford Online

View all
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
1:04:40
Stanford Online Stanford Online

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.

about 8 hours ago · 9 points