Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.