Watch the Bellman update propagate values through a 3×4 grid world
V(s) ← max_a Σ P(s'|s,a) [R + γ·V(s')] For each non-terminal state s: For each action a ∈ {↑,↓,←,→}: Q(s,a) = Σ P(s'|s,a)[R(s,a,s') + γ·V(s')] V(s) ← max_a Q(s,a) π(s) ← argmax_a Q(s,a)