Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation
This lecture explores how to evaluate language models, examining different definitions of "good" from benchmark scores to economic usage, diving deep into perplexity as a foundational metric and its limitations, and tracing the evolution of exam-based benchmarks like MMLU from novel challenges to saturated metrics.