Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
TL;DR
This lecture establishes that training data—not architecture—is the most critical and secretive component of modern language models, examining the technical impossibility of crawling the 'entire internet,' the three-stage data pipeline from raw web to specialized post-training, and the tightening legal constraints of copyright law and terms of service that increasingly restrict what can be legally used for training.
🔒 Data as the Secret Sauce 3 insights
Data secrecy as competitive advantage
Companies like Meta disclose full architecture details in Llama 3 papers but deliberately omit data sources to protect competitive advantages and avoid copyright liability.
The human bottleneck in curation
Unlike systems engineering, data curation scales with human effort and remains a long-tail bottleneck requiring surprisingly large teams for cleaning and quality control.
Three-stage quality progression
The pipeline progresses from massive low-quality pre-training corpora, through mid-training with synthetic and high-quality web data, to specialized post-training on chat transcripts and task-specific environments.
🕸️ The Crawlable Web vs. Reality 3 insights
The 'entire internet' fallacy
Pre-training uses static crawls rather than live web access, missing dynamic app-based content, authenticated platforms like Facebook and X, and pages behind paywalls.
Technical barriers to crawling
Robots.txt files increasingly block specific AI crawlers (OAI, Perplexity, Claude), while Cloudflare challenges, rate limits, and IP bans prevent automated data collection.
Surge in access restrictions
Research by Shane Lampray shows robots.txt restrictions jumped significantly after mid-2023, with terms of service increasingly prohibiting AI training usage across nearly half the web.
⚖️ Copyright and Legal Risk 3 insights
Automatic copyright protection
Under the 1976 Copyright Act, all web content is automatically protected for 75 years upon fixation, requiring no registration and covering virtually everything on the internet.
Licensing and public domain
Legal training data comes from public domain works (75+ years old), Creative Commons sources like Wikipedia, or expensive licensing deals with content platforms.
Shadow libraries and piracy
Repositories like LibGen and Anna's Archive provide massive copyrighted book and paper corpora but constitute clear copyright infringement and legal liability for model developers.
Bottom Line
Given tightening technical restrictions and copyright barriers, building legally clean, high-quality training datasets through licensing or careful curation has become the primary competitive moat in foundation model development.
More from Stanford Online
View all
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge
Former OpenAI researcher Yash Bottle traces AI model evolution from AlexNet to reasoning agents, identifying continual learning as the next bottleneck while explaining why code dominance stems from verifiable rewards and why enterprises must leverage proprietary data to bridge the gap between frontier models and business context.
Stanford MS&E435 | Spring 2026 | Economics of Generative AI
Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning
This seminar presents a neuro-symbolic approach to robot learning that combines neural visual representations with physics-based constraint optimization to enable one-shot skill acquisition, achieving over 90% success rates on novel objects compared to 0% for standard policy learning methods.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.