Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality

Stanford Online

| Podcasts | June 04, 2026 | 4.8 Thousand views | 1:17:40

TL;DR

This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.

🎯 Multimodal Foundations 3 insights

The omni model vision

Future AI systems must handle arbitrary combinations of text, image, audio, and video as both inputs and outputs, requiring a unified architecture that processes any modality.

Tokenization beyond text

Unlike text subwords which carry semantic meaning, raw pixels and audio samples are not semantically meaningful units and must be converted into discrete or continuous tokens via specialized encoders.

Transformers as universal processors

Despite being designed for text, transformers remain the dominant architecture across all modalities at scale, necessitating methods to feed non-sequential data into token-based systems.

🖼️ CLIP Architecture and Training 3 insights

Contrastive learning objective

CLIP trains by maximizing dot products between matching image-text embeddings within a batch while minimizing mismatches, effectively performing multi-class classification across 32,000+ examples simultaneously.

Web-scale training data

The model leverages 400 million noisy image-text pairs scraped from the internet rather than curated datasets like ImageNet, demonstrating that semantically aligned web captions provide sufficient supervision.

Zero-shot breakthrough

Without task-specific fine-tuning, CLIP outperformed ResNet models trained on 1.2 million ImageNet annotations, eliminating the need for expensive manual labeling through Amazon Mechanical Turk.

⚙️ Technical Implementation 3 insights

Vision Transformer encoding

CLIP uses a ViT-L/14 encoder that splits 336x336 images into 14x14 patches with 1D positional embeddings, applying attention pooling rather than averaging to aggregate patch representations into single vectors.

Batch size constraints

The contrastive loss requires massive batch sizes of approximately 32,000 because the softmax operates over the full batch, making the process computationally intensive and difficult to parallelize compared to standard language modeling.

SigLIP efficiency improvement

Google's SigLIP replaces CLIP's batch-level softmax with binary sigmoid classification that evaluates individual image-text pairs independently, improving scalability and training efficiency.

Bottom Line

Modern multimodal AI relies on converting images into semantic tokens through encoders like CLIP's Vision Transformer, which learns aligned representations from noisy web-scale data using contrastive objectives rather than curated manual annotations.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

12 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

26 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories