Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
This lecture establishes that training data—not architecture—is the most critical and secretive component of modern language models, examining the technical impossibility of crawling the 'entire internet,' the three-stage data pipeline from raw web to specialized post-training, and the tightening legal constraints of copyright law and terms of service that increasingly restrict what can be legally used for training.