πŸ–¨οΈ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

Search, Games, and MDPs

From Optimal Paths to Sequential Decisions

2

Learning Objectives

3

The Big Picture

4

Part 1: Informed Search β€” A*

5

Search as Problem Solving

6

From Uninformed to Informed Search

7

A* Evaluation Function

8

Admissibility and Consistency

9

A* Search Algorithm

function AStar(start, goal, h)
  OPEN ← priority queue with {start}, g(start) = 0
  CLOSED ← empty set

  while OPEN is not empty do
    n ← node in OPEN with lowest f(n) = g(n) + h(n)
    if n = goal then
      return reconstruct_path(n)
    move n to CLOSED
    for each successor m of n do
      tentative_g ← g(n) + cost(n, m)
      if tentative_g < g(m) then
        g(m) ← tentative_g
        parent(m) ← n
        add or update m in OPEN
  return failure  // no path exists
10

A* Worked Example

11

Interactive Demo: A* Algorithm

12

A* and LLM Decoding

13

Part 2: Game Playing β€” Adversarial Search

14

Adversarial Search: The Challenge

15

Game Trees

16

Minimax Algorithm

17

Minimax Worked Example

18

Depth-Limited Minimax

19

Alpha-Beta Pruning: Intuition

20

Alpha-Beta Pruning: Mechanics

21

Alpha-Beta Worked Example

22

The Limits of Minimax + Alpha-Beta

23

Monte Carlo Tree Search (MCTS): Overview

24

MCTS Phase 1–2: Selection and Expansion

25

MCTS Phase 3–4: Simulation and Backpropagation

26

UCB1: Exploration vs. Exploitation

27

MCTS Worked Example

28

Interactive Demo: MCTS

29

From MCTS to AlphaGo

30

Part 3: From Search to Learning β€” MDPs

31

The Pivot: Why We Need RL

32

Search/Planning vs. Reinforcement Learning

Classical SearchReinforcement Learning
Model of worldKnown exactlyUnknown β€” learned from interaction
OutcomesDeterministic (typically)Stochastic
ObjectiveReach a goal stateMaximize cumulative reward
ApproachCompute optimal planLearn optimal behavior from experience
33

The MDP Formalism

34

Markov Property and Policies

35

The Key Connection to LLMs

36

Return and Value Functions

37

The Bellman Equation

38

Interactive Demo: Bellman Value Iteration

39

LLM as an MDP

MDP ConceptLLM Equivalent
State $s$Prompt + tokens generated so far
Action $a$Next token from vocabulary (|V| β‰ˆ 100K)
Policy $\pi_\theta(a|s)$Transformer's softmax output distribution
Transition $P(s'|s,a)$Deterministic: append chosen token to context
Reward $R$Score at end of generation (human, reward model, or verifier)
EpisodeOne complete generated response
40

Summary

41

All Interactive Demos

42

Lecture Summary

43

Supplementary Resources