๐Ÿ–จ๏ธ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

Policy Gradients and Variance Reduction

Learning from Experience

2

Learning Objectives

3

From Last Time

4

Part 1: The RL Problem

5

The Agent-Environment Loop

6

Supervised Learning vs. Reinforcement Learning

Supervised LearningReinforcement Learning
FeedbackCorrect label for each inputScalar reward (no 'correct answer')
Datai.i.d. samplesSequential, correlated, agent-generated
ConsequencesNone โ€” predictions are independentActions affect future states
Core challengeGeneralizationCredit assignment + exploration
7

Why Value-Based Methods Don't Scale

8

The Policy-Based Insight

9

Part 2: The Policy Gradient Theorem

10

Interactive Demo: Policy Gradient

11

The RL Objective

12

The Fundamental Problem

13

The Log-Derivative Trick

14

The Policy Gradient Theorem

15

Intuition: What Does It Mean?

16

The REINFORCE Algorithm

17

The Problem: High Variance

18

Part 3: Variance Reduction

19

Interactive Demo: Advantage & Variance

20

Baselines: A Simple Fix

21

The Advantage Function

22

Actor-Critic Architecture

23

Generalized Advantage Estimation (GAE)

24

The ฮป Bias-Variance Tradeoff

25

Part 4: Mapping to LLMs

26

Actor-Critic for Language Models

27

RL Concepts โ†’ LLM Training

RL ConceptLLM Equivalent
Policy $\pi_\theta(a|s)$Transformer softmax over vocabulary
Trajectory $\tau$One complete generated response
Return $G_t$Score assigned to the response
Advantage $\hat{A}_t$Was this token better/worse than expected?
ActorTransformer (language model head)
CriticTransformer (scalar value head)
Policy gradientReward-weighted cross-entropy loss
28

Summary

29

All Interactive Demos

30

Lecture Summary

31

Supplementary Resources