Markov Decision Process

6 min read

Model: Mathematical models of dynamics and reward Policy: Function mapping agent's states to actions Value function: future rewards from being in a state and/or action following a particular policy

Markov Processes

graph LR
    World -->|State| Agent
    World -->|Reward| Agent
    Agent -->|Action| World

Markov Property

State $s_t$ is Markov if and only if:

$p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_{t},a_{t})$ $t$ is timestep $a_{t}$ is action $h_t$ is history recall, sequence of all previous action and rewards and states those we have seen up until current time point

อธิบาย Markov Property บอกเราว่า future state จะขึ้นอยู่กับ current state เท่านั้น และจะไม่ขึ้นกับ sequence of states/events ก่อนหน้า current state

เหมือนกับบอกว่า Process นี้ memoryless, states หรือ actions ก่อนหน้า current state จะไม่ส้งผลต่อ future state

Markov Process / Markov Chain

Markov Process is stochastic process that satisfy Markov property $→\rightarrow$ Random process where the future state depends only on the current state and not on the sequence of events that led to the current state

Markov Chain is Markov process that is discrete in time

Definition

$S$ is a (finite) set of stats ( $\in S$ )
Transition/Dynamic model $P=p(s_{t+1}=s^{'}|s_{t}=s)$

มีเซ็ทของ states + มี dynamic model ที่ระบุ probability ที่จะไปที่ state ต่อไปเมื่อให้ค่าของ state ปัจจุบัน + ที Markov property = Markov Process

NOTE: no reward and action related at the moment

Notation

การที่เราเขียน $p (a ∣ b)$ มันคือ บอกว่า ถ้าเราให้ค่า b ไปจะมีโอกาส $P = p (a ∣ b)$ ที่จะได้ a หรือถ้าเป็น $p(s_{t+1}=s^{'}|s_{t}=s)$ คือ ถ้าเรามี state $s_t$ จะมีโอกาส $P=p(s_{t+1}=s^{'}|s_{t}=s)$ ที่จะไป state $s_{t+1}$

Example and Transition Matrix

สมมติเรามีโมเดลอากาศที่มีสอง state ฝนตก หรือ แดดออก Markov Chain จะ model สภาพอากาศเป็น random process แบบนี้

ถ้า แดดออก วันนี้ พรุ่งนี้มีโอกาส 70% แดดออก 30% ฝนตก
ถ้า ฝนตก วันนี้ พรุ่งนี้มีโอกาส 40% แดดออก 60% ฝนตก

เราสามารถเขียน transition matrix ได้แบบนี้

P=(0.70.30.40.6)P=\begin{pmatrix} 0.7 & 0.3\\ 0.4 & 0.6 \end{pmatrix}

Markov Reward Processes (MRPs)

Markov Reward Process is a Markov Chain + rewards

Definition

$S$ is a (finite) set of stats ( $\in S$ )
Transition/Dynamic model $P=p(s_{t+1}=s^{'}|s_{t}=s)$
$R$ is a reward function $R(S_{t} =s)=Expected(r_{t}|s_{t}=s)$ (Expected reward you get from being in the state)
Discount factor $γ∈[0,1]\gamma \in [0, 1]$ (immediate reward | future reward)

NOTE: no action related at the moment

Expected Return

เป้าหมายของ MRP คือการคำนวณ expected return ของ state $s$ โดยสมการ

$V(s)=E[∑_{t=0}^∞γ^tR(s_{t})∣s_{0}=s]$

แต่จะเห็นว่ามันดูยุ่งยากเพราะฉะนั้นเราจะมาใช้อีกวิธีในการคำนวณ expected return ด้วยการ บวกค่าระหว่าง immediate reward + discounted future reward

Bellman Equation for MRP

เราสามารถคำนวณ expected return แบบ recursively ได้ด้วย Bellman Equation

$\gamma \displaystyle\sum_{s'} P(s' | s) V(s')$

$R (s)$ is the immediate reward obtained from state $s$ ,
$γ\gamma$ is the discount factor,
$P(s^′∣s)$ is the probability of transitioning from state $s$ to state $s^′$ ,
$V(s^′)$ is the value function of the next state $s^′$ .

Markov Decision Processes (MDPs)

Markov Decision Process is Markov Reward Process + actions

Definition

$S$ is a (finite) set of stats ( $\in S$ )
$A$ is a (finite) set of actions $a∈Aa\in A$
Transition/Dynamic model for each action, $P=p(s_{t+1}=s^{'}|s_{t}=s, a_{t}=a)$
$R$ is a reward function $R(S_{t} =s)=Expected(r_{t}|s_{t}=s)$ (Expected reward you get from being in the state)
Discount factor $γ∈[0,1]\gamma \in [0, 1]$ (immediate reward | future reward)

MDP is a tuple: $\gamma)$

MDP Policy

Policy จะบอกเราว่าต้องทำ action อะไรในแต่ละ state --- Specify what action to take in each state

Can be deterministic (ที่ state ไหนจะใช้ action ไหน) or stochastic (action จะถูกเลือกแบบสุ่ม (มี probability))

สามารถเขียนได้โดย $π(a∣s)=P(at=a∣st=s)\pi(a|s) = P(a_{t}=a|s_{t}=s)$

เราสามารถมองได้ว่า

MDP + $π(a∣s)\pi(a|s)$ = Markov Reward Process (ถ้าเรา fixed policy)

MDP Policy Evaluation

$Vkπ(s)=R(s,π(s))+γ∑s′∈Sp(s′∣s,π(s))Vk−1π(s′)V^\pi_{k}(s) = R(s, \pi(s)) + \gamma \displaystyle\sum_{s'\in S} p(s' | s, \pi(s)) V^\pi_{k-1}(s')$

เราต้องการคำนวณ ค่า value ของแต่ละ state ภายใต้ policy ที่กำหนดไว้ $π\pi$ ซึ่งก็คือการคำนวณว่า ถ้าเราเริ่มจาก state $s$ และทำตาม policy $π\pi$ เราจะได้ expected return เท่าไหร่

ในกรณีนี้เราใช้การคำนวณแบบ iterative (เชิงซ้ำ) โดยเริ่มจากค่าเริ่มต้นของ value function และทำการอัปเดตซ้ำไปเรื่อย ๆ ตามสูตร

โดยที่:

$Vkπ(s)V^\pi_{k}(s)$ คือ ค่าของ state $s$ ในรอบที่ $k$ ภายใต้ policy $π\pi$
$\pi(s))$ คือ ค่าผลตอบแทนทันที (immediate reward) ที่ได้จากการทำ action ตาม policy ที่ state $s$
$\mid s, \pi(s))$ คือ ความน่าจะเป็น ที่จะย้ายไปยัง state $s^{'}$ หลังจากทำ action ตาม policy ที่ state $s$
$γ\gamma$ คือ discount factor ที่ใช้ลดค่าผลตอบแทนในอนาคต (เช่น $γ=0.9\gamma = 0.9$ หมายถึงให้ความสำคัญกับอนาคต 90%)

กระบวนการประเมิน Policy

กำหนดค่าเริ่มต้นของ $V_0(s)$ สำหรับทุก $\in S$ (เช่น เริ่มจากศูนย์)
ทำการอัปเดตค่า $V_k(s)$ ตามสูตรด้านบน จนกว่าค่าจะ คงที่ หรือ เปลี่ยนแปลงน้อยมาก (convergence)
ผลลัพธ์ที่ได้คือ $Vπ(s)V^\pi(s)$ ซึ่งบอกว่า state $s$ มี expected return เท่าใดเมื่อทำตาม policy $π\pi$

Markov Processes #

Markov Property #

Markov Process / Markov Chain #

Example and Transition Matrix #

Markov Reward Processes (MRPs) #

Expected Return #

Bellman Equation for MRP #

Markov Decision Processes (MDPs) #

MDP Policy #

MDP Policy Evaluation #

กระบวนการประเมิน Policy #

Markov Processes

Markov Property

Markov Process / Markov Chain

Example and Transition Matrix

Markov Reward Processes (MRPs)

Expected Return

Bellman Equation for MRP

Markov Decision Processes (MDPs)

MDP Policy

MDP Policy Evaluation

กระบวนการประเมิน Policy