Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
TL;DR
This lecture details the pre-training data pipeline, covering the transformation of raw HTML and PDFs into linear text and classifier-based filtering strategies to curate domain-specific datasets, while emphasizing the strategic trade-off between data quality and training duration.
🔄 Raw Data Transformation 3 insights
HTML linearization is inherently lossy
Converting hierarchical HTML structures into linear token sequences destroys layout information, making nested tables and visual formatting particularly challenging to represent accurately.
Rule-based extraction dominates for speed
Processing relies on fast heuristic rules to remove boilerplate like navigation and ads rather than slow model-based methods, accepting imperfections for throughput.
PDFs require expensive visual processing
PDFs in Common Crawl often need recrawling due to truncation and require OCR or vision-language models to extract text, but offer higher average quality than HTML.
🎯 Classifier-Based Filtering 4 insights
Filtering matches raw data to target distributions
The standard approach trains fast classifiers like FastText to distinguish desired target data from random web samples, keeping only high-scoring documents.
Language identification scales to 176 languages
FastText models trained on Wikipedia and translation sites efficiently detect languages, though code-switching remains a subtle challenge.
Domain-specific filters dramatically boost performance
Targeted collections like OpenMathText use LaTeX detection and perplexity scoring to curate math data, yielding better models than training on 20x unfiltered data.
Quality definitions require application-specific classifiers
Different filters optimize for educational value, toxicity removal, or encyclopedic style using distinct positive examples from Wikipedia, GPT-4 judgments, or human annotations.
⚖️ Quality versus Quantity Trade-offs 2 insights
Optimal filtering thresholds depend on training duration
Stricter quality filters suit shorter training runs to prevent overfitting, while longer training runs can leverage larger pools of lower-quality data effectively.
High-quality data shows diminishing returns on repetition
Experiments show that repeating small high-quality datasets like DCLM causes overfitting, whereas training longer on lower-quality unfiltered data eventually achieves better loss.
Bottom Line
Effective data curation requires dynamically balancing filtering stringency against total token volume based on your computational budget, using fast classifiers to select domain-specific subsets that prevent overfitting during short training runs while maximizing coverage for long runs.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Sachin Katti, OpenAI's head of industrial compute, details the infrastructure economics driving the AI supercycle, explaining how the company plans to scale to 30 gigawatts by 2030 while navigating the shift from training to inference-heavy agentic workloads and managing massive energy and supply chain constraints.
Stanford CS25: Transformers United V6 I Advancing Science and Medicine with Collaborative AI Agents
Google DeepMind researcher Vivek Natarajan discusses the development of Co-Scientist, an AI system designed to act as a collaborative partner for scientific discovery by moving beyond fast System 1 thinking to rigorous System 2 reasoning, emphasizing that true scientific AI requires the generality of human cognition rather than narrow specialization.