Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)

Stanford Online

| Podcasts | May 19, 2026 | 4.42 Thousand views | 1:22:02

TL;DR

This lecture establishes that training data—not architecture—is the most critical and secretive component of modern language models, examining the technical impossibility of crawling the 'entire internet,' the three-stage data pipeline from raw web to specialized post-training, and the tightening legal constraints of copyright law and terms of service that increasingly restrict what can be legally used for training.

🔒 Data as the Secret Sauce 3 insights

Data secrecy as competitive advantage

Companies like Meta disclose full architecture details in Llama 3 papers but deliberately omit data sources to protect competitive advantages and avoid copyright liability.

The human bottleneck in curation

Unlike systems engineering, data curation scales with human effort and remains a long-tail bottleneck requiring surprisingly large teams for cleaning and quality control.

Three-stage quality progression

The pipeline progresses from massive low-quality pre-training corpora, through mid-training with synthetic and high-quality web data, to specialized post-training on chat transcripts and task-specific environments.

🕸️ The Crawlable Web vs. Reality 3 insights

The 'entire internet' fallacy

Pre-training uses static crawls rather than live web access, missing dynamic app-based content, authenticated platforms like Facebook and X, and pages behind paywalls.

Technical barriers to crawling

Robots.txt files increasingly block specific AI crawlers (OAI, Perplexity, Claude), while Cloudflare challenges, rate limits, and IP bans prevent automated data collection.

Surge in access restrictions

Research by Shane Lampray shows robots.txt restrictions jumped significantly after mid-2023, with terms of service increasingly prohibiting AI training usage across nearly half the web.

⚖️ Copyright and Legal Risk 3 insights

Automatic copyright protection

Under the 1976 Copyright Act, all web content is automatically protected for 75 years upon fixation, requiring no registration and covering virtually everything on the internet.

Licensing and public domain

Legal training data comes from public domain works (75+ years old), Creative Commons sources like Wikipedia, or expensive licensing deals with content platforms.

Shadow libraries and piracy

Repositories like LibGen and Anna's Archive provide massive copyrighted book and paper corpora but constitute clear copyright infringement and legal liability for model developers.

Bottom Line

Given tightening technical restrictions and copyright barriers, building legally clean, high-quality training datasets through licensing or careful curation has become the primary competitive moat in foundation model development.

Watch on YouTube

More from Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

13 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

20 days ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

21 days ago · 10 points

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Stanford Online

Stanford CS153 Frontier Systems | Scale, AGI, and the Future of Everything

Sam Altman explains how AI has fundamentally altered startup economics, enabling small teams to achieve unprecedented scale, while sharing OpenAI's journey from research lab to product company and arguing that pushing systems beyond conventional scaling limits often reveals emergent properties that consensus thinking misses.

22 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories