Markov Decision Process

Model: Mathematical models of dynamics and reward Policy: Function mapping agent’s states to actions Value function: future rewards from being in a state and/or action following a particular policy

Markov Processes

graph LR
    World -->|State| Agent
    World -->|Reward| Agent
    Agent -->|Action| World

Markov Property

State $s_{t}$ is Markov if and only if:

$p (s_{t + 1} ∣ s_{t}, a_{t}) = p (s_{t + 1} ∣ h_{t}, a_{t})$ $t$ is timestep $a_{t}$ is action $h_{t}$ is history recall, sequence of all previous action and rewards and states those we have seen up until current time point

อธิบาย Markov Property บอกเราว่า future state จะขึ้นอยู่กับ current state เท่านั้น และจะไม่ขึ้นกับ sequence of states/events ก่อนหน้า current state

เหมือนกับบอกว่า Process นี้ memoryless, states หรือ actions ก่อนหน้า current state จะไม่ส้งผลต่อ future state

Markov Process / Markov Chain

Markov Process is stochastic process that satisfy Markov property $\to$ Random process where the future state depends only on the current state and not on the sequence of events that led to the current state

Markov Chain is Markov process that is discrete in time

Definition

$S$ is a (finite) set of stats ( $s \in S$ )
Transition/Dynamic model $P = p (s_{t + 1} = s^{^{'}} ∣ s_{t} = s)$

มีเซ็ทของ states + มี dynamic model ที่ระบุ probability ที่จะไปที่ state ต่อไปเมื่อให้ค่าของ state ปัจจุบัน + ที Markov property = Markov Process

NOTE: no reward and action related at the moment

Notation

การที่เราเขียน $p (a ∣ b)$ มันคือ บอกว่า ถ้าเราให้ค่า b ไปจะมีโอกาส $P = p (a ∣ b)$ ที่จะได้ a หรือถ้าเป็น $p (s_{t + 1} = s^{^{'}} ∣ s_{t} = s)$ คือ ถ้าเรามี state $s_{t}$ จะมีโอกาส $P = p (s_{t + 1} = s^{^{'}} ∣ s_{t} = s)$ ที่จะไป state $s_{t + 1}$

Example and Transition Matrix

สมมติเรามีโมเดลอากาศที่มีสอง state ฝนตก หรือ แดดออก Markov Chain จะ model สภาพอากาศเป็น random process แบบนี้

ถ้า แดดออก วันนี้ พรุ่งนี้มีโอกาส 70% แดดออก 30% ฝนตก
ถ้า ฝนตก วันนี้ พรุ่งนี้มีโอกาส 40% แดดออก 60% ฝนตก

เราสามารถเขียน transition matrix ได้แบบนี้

P = (0.7 0.4 0.3 0.6)

Markov Reward Processes (MRPs)

Markov Reward Process is a Markov Chain + rewards

Definition

$S$ is a (finite) set of stats ( $s \in S$ )
Transition/Dynamic model $P = p (s_{t + 1} = s^{^{'}} ∣ s_{t} = s)$
$R$ is a reward function $R (S_{t} = s) = E x p ec t e d (r_{t} ∣ s_{t} = s)$ (Expected reward you get from being in the state)
Discount factor $γ \in [0, 1]$ (immediate reward | future reward)

NOTE: no action related at the moment

Expected Return

เป้าหมายของ MRP คือการคำนวณ expected return ของ state $s$ โดยสมการ

$V (s) = E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}) ∣ s_{0} = s]$

แต่จะเห็นว่ามันดูยุ่งยากเพราะฉะนั้นเราจะมาใช้อีกวิธีในการคำนวณ expected return ด้วยการ บวกค่าระหว่าง immediate reward + discounted future reward

Bellman Equation for MRP

เราสามารถคำนวณ expected return แบบ recursively ได้ด้วย Bellman Equation

$V (s) = R (s) + γ s^{'} \sum P (s^{'} ∣ s) V (s^{'})$

$R (s)$ is the immediate reward obtained from state $s$ ,
$γ$ is the discount factor,
$P (s^{'} ∣ s)$ is the probability of transitioning from state $s$ to state $s^{'}$ ,
$V (s^{'})$ is the value function of the next state $s^{'}$ .