Reinforcement Learning with Neural Networks: Mathematical Details

StatQuest with Josh Starmer

| AI & Machine Learning | April 14, 2025 | 26.8 Thousand views | 25:01

TL;DR

This video provides a step-by-step mathematical walkthrough of policy gradient reinforcement learning, demonstrating how to derive gradients via the chain rule and use binary reward signals (+1/-1) to correct update directions when training neural networks without labeled data.

🎲 Policy Gradient Fundamentals 2 insights

Stochastic action selection

The neural network outputs probabilities for each possible action (e.g., P_Norm vs. P_Squatch), and the agent randomly samples from this distribution rather than picking the highest probability, enabling exploration of the action space.

Cross-entropy on taken actions

Since ground-truth labels are unavailable, the loss function calculates cross-entropy between the probability of the action actually selected and the ideal probability of 1.0, quantifying how far the network is from certainty.

🧮 Chain Rule Mathematics 2 insights

Three-component gradient decomposition

The derivative of cross-entropy with respect to the bias requires chaining through: d(CE)/d(output) × d(sigmoid)/d(input) × d(linear)/d(bias).

Simplification via sigmoid properties

The derivative simplifies elegantly to P_Norm when visiting Squatch's (since terms cancel through 1/(1-P) × P(1-P)) and to -(1-P_Norm) when visiting Norm's, connecting the gradient directly to the probability of the alternative action.

⚖️ Reward-Driven Optimization 2 insights

Reward as directional correction

Multiplying the gradient by the observed reward (+1 for correct guesses, -1 for incorrect) flips the update direction when actions yield negative outcomes, effectively correcting the mistaken assumption that the taken action was optimal.

Iterative bias convergence

Repeatedly sampling actions, computing reward-weighted derivatives, and updating the bias via gradient descent causes the network to converge—after many input examples—to a stable bias value (around -10 in the demo) that optimizes behavior across all states.

Bottom Line

Multiply every policy gradient by the environment's observed reward signal—positive rewards reinforce the current update direction while negative rewards reverse it, converting trial-and-error experience into precise parameter updates without ground-truth labels.

Watch on YouTube

More from StatQuest with Josh Starmer

The Essence of Linear Regression

StatQuest with Josh Starmer

The Essence of Linear Regression

Linear regression finds the optimal line through data by minimizing the sum of squared residuals (the 'least squares' method), enabling predictions that can be evaluated for accuracy using R-squared—a metric comparing the model's performance against simply using the mean value.

about 1 month ago · 6 points

How AI works in Super Simple Terms!!!

StatQuest with Josh Starmer

How AI works in Super Simple Terms!!!

AI fundamentally works by converting text prompts into numerical coordinates and processing them through massive mathematical equations with trillions of parameters to predict the next word, requiring extensive training on internet-scale data followed by targeted alignment to produce useful responses.

5 months ago · 7 points

Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!

StatQuest with Josh Starmer

Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!

Reinforcement Learning with Human Feedback (RLHF) aligns large language models to produce helpful, polite responses by training a reward model on human preference comparisons, solving the overfitting and cost limitations of supervised fine-tuning.

about 1 year ago · 9 points

Reinforcement Learning with Neural Networks: Essential Concepts

StatQuest with Josh Starmer

Reinforcement Learning with Neural Networks: Essential Concepts

This video explains how policy gradients enable neural network training without known target values by guessing actions, observing environmental rewards, and using those rewards to correct the direction of gradient descent updates.

about 1 year ago · 9 points

Browse more: 🤖 AI & Machine Learning All Videos All Categories