Reinforcement Learning with Neural Networks: Mathematical Details

| AI & Machine Learning | April 14, 2025 | 26.8 Thousand views | 25:01

TL;DR

This video provides a step-by-step mathematical walkthrough of policy gradient reinforcement learning, demonstrating how to derive gradients via the chain rule and use binary reward signals (+1/-1) to correct update directions when training neural networks without labeled data.

🎲 Policy Gradient Fundamentals 2 insights

Stochastic action selection

The neural network outputs probabilities for each possible action (e.g., P_Norm vs. P_Squatch), and the agent randomly samples from this distribution rather than picking the highest probability, enabling exploration of the action space.

Cross-entropy on taken actions

Since ground-truth labels are unavailable, the loss function calculates cross-entropy between the probability of the action actually selected and the ideal probability of 1.0, quantifying how far the network is from certainty.

🧮 Chain Rule Mathematics 2 insights

Three-component gradient decomposition

The derivative of cross-entropy with respect to the bias requires chaining through: d(CE)/d(output) × d(sigmoid)/d(input) × d(linear)/d(bias).

Simplification via sigmoid properties

The derivative simplifies elegantly to P_Norm when visiting Squatch's (since terms cancel through 1/(1-P) × P(1-P)) and to -(1-P_Norm) when visiting Norm's, connecting the gradient directly to the probability of the alternative action.

⚖️ Reward-Driven Optimization 2 insights

Reward as directional correction

Multiplying the gradient by the observed reward (+1 for correct guesses, -1 for incorrect) flips the update direction when actions yield negative outcomes, effectively correcting the mistaken assumption that the taken action was optimal.

Iterative bias convergence

Repeatedly sampling actions, computing reward-weighted derivatives, and updating the bias via gradient descent causes the network to converge—after many input examples—to a stable bias value (around -10 in the demo) that optimizes behavior across all states.

Bottom Line

Multiply every policy gradient by the environment's observed reward signal—positive rewards reinforce the current update direction while negative rewards reverse it, converting trial-and-error experience into precise parameter updates without ground-truth labels.

More from StatQuest with Josh Starmer

View all
How AI works in Super Simple Terms!!!
22:51
StatQuest with Josh Starmer StatQuest with Josh Starmer

How AI works in Super Simple Terms!!!

AI fundamentally works by converting text prompts into numerical coordinates and processing them through massive mathematical equations with trillions of parameters to predict the next word, requiring extensive training on internet-scale data followed by targeted alignment to produce useful responses.

2 months ago · 7 points

More in AI & Machine Learning

View all
This picture broke my brain
44:52
3Blue1Brown 3Blue1Brown

This picture broke my brain

This video unpacks M.C. Escher's "Print Gallery" lithograph, revealing how its paradoxical infinite loop relies on a conformal grid derived from complex analysis to transform a linear Droste effect into a continuous circular zoom, mathematically resolving the mysterious blank center.

3 days ago · 9 points