Stanford CS221 | Autumn 2025 | Lecture 14: Bayesian Networks and Learning
TL;DR
This lecture explains how to learn Bayesian network parameters from fully observed data through simple counting and normalization, while reviewing probabilistic inference methods and d-separation rules for determining conditional independence.
🔗 Conditional Independence and Inference 3 insights
D-separation determines conditional independence
Variables are conditionally independent given set C if all paths between them are blocked by C according to three graphical patterns: chains, common causes, and common effects.
Explaining away occurs in V-structures
Conditioning on a common effect (like Alarm) or its descendants makes its parents (Burglary and Earthquake) dependent, creating the explaining away phenomenon.
Inference uses exact or sampling methods
Exact inference marginalizes joint probability tables directly, while approximate algorithms like rejection sampling and Gibbs sampling estimate probabilities through simulation.
📊 Parameter Learning from Complete Data 3 insights
Fully observed setting enables direct counting
When training data contains complete assignments to all variables, parameter estimation reduces to counting occurrences and normalizing into probability distributions.
Local distributions estimate independently
Each node's conditional probability table is learned separately by counting only the relevant parent-child value combinations and ignoring other variables.
Multi-parent nodes require stratified counting
For nodes with multiple parents, maintain separate count tables for each parent value combination and normalize each into a conditional distribution.
🎬 Practical Learning Examples 2 insights
Single variables use frequency counts
Learning a standalone movie rating distribution requires simply counting occurrences of each rating value and dividing by the total number of observations.
Conditional probabilities stratify by parents
To learn P(Rating|Genre), count rating occurrences separately within each genre category (Drama vs. Comedy) and normalize within each group independently.
Bottom Line
When data is fully observed, Bayesian network parameter learning requires no iterative optimization—simply count co-occurrences of variables with their parents and normalize these counts into local conditional probability tables.
More from Stanford Online
View all
Stanford CS221 | Autumn 2025 | Lecture 20: Fireside Chat, Conclusion
Percy Liang reflects on AI's transformation from academic curiosity to global infrastructure, debunking sci-fi misconceptions about capabilities while arguing that academia's role in long-term research and critical evaluation remains essential as the job market shifts away from traditional entry-level software engineering.
Stanford CS221 | Autumn 2025 | Lecture 19: AI Supply Chains
This lecture examines AI's economic impact through the lens of supply chains and organizational strategy, demonstrating why understanding compute monopolies, labor market shifts, and corporate decision-making is as critical as tracking algorithmic capabilities.
Stanford CS221 | Autumn 2025 | Lecture 18: AI & Society
This lecture argues that AI developers bear unique ethical responsibility for societal outcomes, framing AI as a dual-use technology that requires active steering toward beneficial applications while preventing misuse and accidental harms through rigorous auditing and an ecosystem-aware approach.
Stanford CS221 | Autumn 2025 | Lecture 17: Language Models
This lecture introduces modern language models as industrial-scale systems requiring millions of dollars and trillions of tokens to train, explaining their fundamental operation as auto-regressive next-token predictors that encode language structure through massive statistical modeling.