Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
TL;DR
This Stanford lecture establishes aesthetics and prompt adherence as the dual pillars for evaluating text-to-image models, compares human evaluation methods from noisy absolute ratings to reliable pairwise comparisons, and details the ELO rating system for robust model benchmarking before addressing the scalability crisis that necessitates automated metrics.
📊 Core Evaluation Dimensions 2 insights
Prompt adherence vs. aesthetic quality
Models must be judged both on following input prompts accurately and on physical plausibility and visual appeal.
Safety, diversity, and bias requirements
Beyond quality and adherence, evaluations must consider safety concerns, output diversity, and bias mitigation to ensure robust deployment.
👥 Human Evaluation Paradigms 3 insights
Absolute rating scales introduce noise
While 1-5 scales offer nuance, they introduce subjective variance as humans struggle to consistently distinguish between adjacent ratings like 4 and 5.
Binary classification simplifies judgment
Reducing judgments to "good" or "bad" simplifies the task but remains challenging on absolute scales without reference points.
Pairwise comparisons minimize subjectivity
Asking humans to choose between two images is cognitively easier and produces more consistent relative judgments than absolute scoring.
🏆 The ELO Rating System 3 insights
Win rates ignore opponent strength
Defeating a weak model should not count equally to defeating a state-of-the-art model, rendering raw win rates misleading.
Dynamic adjustment based on expected outcomes
The system calculates expected win probabilities based on current ratings and updates scores proportionally to the "surprise" factor of actual outcomes.
Efficient leaderboard maintenance
ELO enables dynamic model ranking without requiring exhaustive round-robin comparisons every time a new model enters the evaluation pool.
🤖 Automation Imperatives 2 insights
Human evaluation faces scalability limits
Manual rating is prohibitively expensive, time-consuming, and vulnerable to human inconsistency and fatigue.
Necessity of automated metrics
These limitations create necessity for reference-free automated evaluation methods that can scale with model development cycles.
Bottom Line
Use pairwise comparisons with ELO ratings rather than absolute scales to obtain reliable human judgments of generative models, then transition to automated metrics to achieve evaluation at scale.
More from Stanford Online
View all
Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required
Former federal prosecutor and tech security chief Joe Sullivan recounts his journey from prosecuting cybercrime to leading security at eBay, Facebook, Uber, and Cloudflare, sharing hard-won lessons on the critical importance of transparency in security incidents through the lens of his personal prosecution for the 2016 Uber data breach cover-up.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
This lecture details the pre-training data pipeline, covering the transformation of raw HTML and PDFs into linear text and classifier-based filtering strategies to curate domain-specific datasets, while emphasizing the strategic trade-off between data quality and training duration.