Ctrl/Cmd + P and select "Save as PDF".
repeat forever:
Agent observes state s_t
Agent samples action a_t ~ ฯ_ฮธ(ยท|s_t)
Environment returns reward r_t
Environment transitions to s_{t+1}| Supervised Learning | Reinforcement Learning | |
|---|---|---|
| Feedback | Correct label for each input | Scalar reward (no 'correct answer') |
| Data | i.i.d. samples | Sequential, correlated, agent-generated |
| Consequences | None โ predictions are independent | Actions affect future states |
| Core challenge | Generalization | Credit assignment + exploration |
for each training iteration:
# 1. Sample trajectory using current policy
trajectory = rollout(ฯ_ฮธ)
# 2. Compute returns for each timestep
G = []
running_return = 0
for t in reversed(range(len(trajectory))):
running_return = r[t] + ฮณ * running_return
G.insert(0, running_return)
# 3. Policy gradient update
loss = -sum(log ฯ_ฮธ(a_t|s_t) * G[t] for t)
loss.backward()
optimizer.step()| RL Concept | LLM Equivalent |
|---|---|
| Policy $\pi_\theta(a|s)$ | Transformer softmax over vocabulary |
| Trajectory $\tau$ | One complete generated response |
| Return $G_t$ | Score assigned to the response |
| Advantage $\hat{A}_t$ | Was this token better/worse than expected? |
| Actor | Transformer (language model head) |
| Critic | Transformer (scalar value head) |
| Policy gradient | Reward-weighted cross-entropy loss |