Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data

Stanford Online

| Podcasts | May 27, 2026 | 3.9 Thousand views | 1:24:46

TL;DR

This lecture details the pre-training data pipeline, covering the transformation of raw HTML and PDFs into linear text and classifier-based filtering strategies to curate domain-specific datasets, while emphasizing the strategic trade-off between data quality and training duration.

🔄 Raw Data Transformation 3 insights

HTML linearization is inherently lossy

Converting hierarchical HTML structures into linear token sequences destroys layout information, making nested tables and visual formatting particularly challenging to represent accurately.

Rule-based extraction dominates for speed

Processing relies on fast heuristic rules to remove boilerplate like navigation and ads rather than slow model-based methods, accepting imperfections for throughput.

PDFs require expensive visual processing

PDFs in Common Crawl often need recrawling due to truncation and require OCR or vision-language models to extract text, but offer higher average quality than HTML.

🎯 Classifier-Based Filtering 4 insights

Filtering matches raw data to target distributions

The standard approach trains fast classifiers like FastText to distinguish desired target data from random web samples, keeping only high-scoring documents.

Language identification scales to 176 languages

FastText models trained on Wikipedia and translation sites efficiently detect languages, though code-switching remains a subtle challenge.

Domain-specific filters dramatically boost performance

Targeted collections like OpenMathText use LaTeX detection and perplexity scoring to curate math data, yielding better models than training on 20x unfiltered data.

Quality definitions require application-specific classifiers

Different filters optimize for educational value, toxicity removal, or encyclopedic style using distinct positive examples from Wikipedia, GPT-4 judgments, or human annotations.

⚖️ Quality versus Quantity Trade-offs 2 insights

Optimal filtering thresholds depend on training duration

Stricter quality filters suit shorter training runs to prevent overfitting, while longer training runs can leverage larger pools of lower-quality data effectively.

High-quality data shows diminishing returns on repetition

Experiments show that repeating small high-quality datasets like DCLM causes overfitting, whereas training longer on lower-quality unfiltered data eventually achieves better loss.

Bottom Line

Effective data curation requires dynamically balancing filtering stringency against total token volume based on your computational budget, using fast classifiers to select domain-specific subsets that prevent overfitting during short training runs while maximizing coverage for long runs.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

4 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

18 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

25 days ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

26 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories