Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)

| Podcasts | May 19, 2026 | 1.61 Thousand views | 1:22:02

TL;DR

This lecture establishes that training data—not architecture—is the most critical and secretive component of modern language models, examining the technical impossibility of crawling the 'entire internet,' the three-stage data pipeline from raw web to specialized post-training, and the tightening legal constraints of copyright law and terms of service that increasingly restrict what can be legally used for training.

🔒 Data as the Secret Sauce 3 insights

Data secrecy as competitive advantage

Companies like Meta disclose full architecture details in Llama 3 papers but deliberately omit data sources to protect competitive advantages and avoid copyright liability.

The human bottleneck in curation

Unlike systems engineering, data curation scales with human effort and remains a long-tail bottleneck requiring surprisingly large teams for cleaning and quality control.

Three-stage quality progression

The pipeline progresses from massive low-quality pre-training corpora, through mid-training with synthetic and high-quality web data, to specialized post-training on chat transcripts and task-specific environments.

🕸️ The Crawlable Web vs. Reality 3 insights

The 'entire internet' fallacy

Pre-training uses static crawls rather than live web access, missing dynamic app-based content, authenticated platforms like Facebook and X, and pages behind paywalls.

Technical barriers to crawling

Robots.txt files increasingly block specific AI crawlers (OAI, Perplexity, Claude), while Cloudflare challenges, rate limits, and IP bans prevent automated data collection.

Surge in access restrictions

Research by Shane Lampray shows robots.txt restrictions jumped significantly after mid-2023, with terms of service increasingly prohibiting AI training usage across nearly half the web.

⚖️ Copyright and Legal Risk 3 insights

Automatic copyright protection

Under the 1976 Copyright Act, all web content is automatically protected for 75 years upon fixation, requiring no registration and covering virtually everything on the internet.

Licensing and public domain

Legal training data comes from public domain works (75+ years old), Creative Commons sources like Wikipedia, or expensive licensing deals with content platforms.

Shadow libraries and piracy

Repositories like LibGen and Anna's Archive provide massive copyrighted book and paper corpora but constitute clear copyright infringement and legal liability for model developers.

Bottom Line

Given tightening technical restrictions and copyright barriers, building legally clean, high-quality training datasets through licensing or careful curation has become the primary competitive moat in foundation model development.

More from Stanford Online

View all
Stanford MS&E435 | Spring 2026 | Economics of Generative AI
34:13
Stanford Online Stanford Online

Stanford MS&E435 | Spring 2026 | Economics of Generative AI

Stanford instructor Apur frames generative AI as a supercycle with inverted economics where semiconductor and infrastructure costs dominate revenues while application-layer value remains elusive, questioning whether this structure represents a temporary capex cycle or a new permanent equilibrium.

2 days ago · 7 points
Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
1:11:12
Stanford Online Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy

UC Berkeley's Icon Lab presents game-theoretic frameworks enabling robots to safely interact with humans and other agents by modeling joint prediction as potential games, reducing computational costs by 20x while solving the challenge of multiple social equilibria in real-time navigation.

2 days ago · 8 points