问题描述:
一个赌徒玩一个游戏,输赢由抛硬币的结果来决定,每一局游戏开始,都必须拿出部分的赌资下注。如果硬币头朝上,他赌多少赢多少(double)。如果头朝下,就输掉赌注。当赌徒达到100美元的目标时,或者缺钱时,游戏结束。
问题抽象
问题可以看作一个离散的马尔可夫决策过程:
- s s s : state,赌徒当前持有的赌资, s ∈ S = { 0 , 1 , 2 , . . . 99 } s\in \mathcal{S}=\{0,1,2,...99\} s∈S={0,1,2,...99}
- a a a : action,赌徒每次下注的金额, a ∈ A = { 1 , 2 , . . . m i n ( s , 100 − s ) } a\in \mathcal{A}=\{1,2,...min(s,100-s)\} a∈A={1,2,...min(s,100−s)}
- r r r : reward,只有达到目标时为1,其他均为0, r ∈ R = { 0 , 1 } r \in \mathcal{R}=\{0,1\} r∈R={0,1}
- γ \gamma γ : discount, γ = 1 \gamma = 1 γ=1
- p h p_h ph : 硬币头朝上的概率
求解策略
方法一 :Policy Iteration
- Initialization:
- V ( s ) ∈ R a n d π ( s ) ∈ A ( s ) a r b i t r a r i l y f o r a l l s ∈ S V(s) \in R \;\; and \,\, \pi(s) \in \mathcal A(s) \; arbitrarily \; for \; all \; s \in \mathcal S V(s)∈Randπ(s)∈A(s)arbitrarilyforalls∈S
- Policy Evaluation:
- Loop:
- Δ ← 0 \Delta \gets 0 Δ←0
- Loop for each
s
∈
S
s \in \mathcal S
s∈S:
- v ← V ( s ) v \gets V(s) v←V(s)
- V ( s ) ← ∑ s ′ , r p ( s ′ , r ∣ s , π ( s ) ) [ r + γ V ( s ′ ) ] V(s) \gets \sum_{s',r}p(s',r|s,\pi(s))[r + \gamma V(s')] V(s)←∑s′,rp(s′,r∣s,π(s))[r+γV(s′)]
- Δ ← m a x ( Δ , ∣ v − V ( s ) ∣ ) \Delta \gets max(\Delta,|v - V(s)|) Δ←max(Δ,∣v−V(s)∣)
- until Δ < = θ \Delta <= \theta Δ<=θ (a small positive number determining the accuracy of estimation)
- Loop:
- Policy Improvement:
- policy-stable ← \gets ← true
- For each
s
∈
S
s \in S
s∈S:
- old-action ← π ( s ) \gets \pi(s) ←π(s)
- π ( s ) ← a r g m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] \pi(s) \gets \mathop{argmax}\limits_{a}\sum_{s',r}p(s',r|s,a)[r + \gamma V(s')] π(s)←aargmax∑s′,rp(s′,r∣s,a)[r+γV(s′)]
- If old-action ≠ π ( s ) \neq \pi(s) =π(s),then policy=stable ← \gets ← false
- If policy-stable,then stop and return V ≈ v ∗ a n d π ≈ π ∗ V \approx v_* \; and \; \pi \approx \pi_* V≈v∗andπ≈π∗; esle go to 2
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
class PolicyIteration:
def __init__(self, goal = 100, proba_h = 0.4, theta=1e-9, gamma = 1):
self.ph = proba_h
self.gamma = gamma
self.goal = goal
self.theta = theta
self.states = np.arange(self.goal + 1)
self.state_values = np.zeros(self.goal + 1)
self.state_values[self.goal] = 1.0
self.policy = np.zeros(self.goal + 1)
self.sweeps_history = []
def policy_evaluation(self):
while True:
old_state_values = self.state_values.copy()
self.sweeps_history.append(old_state_values)
for s in self.states[1:self.goal]:
actions = np.arange(min(s, self.goal - s))+1
action_returns = []
for a in actions:
ret = self.ph * (self.gamma * self.state_values[s + a]) + \
(1 - self.ph) * (self.gamma * self.state_values[s - a])
action_returns.append(ret)
self.state_values[s] = np.mean(action_returns)
delta = abs(self.state_values - old_state_values).max()
if delta <= self.theta:
break
def policy_improvement(self):
policy_stable = True
for s in self.states[1:self.goal]:
actions = np.arange(min(s, self.goal - s))+1
old_a = self.policy[s]
action_returns = []
for a in actions:
ret = self.ph * (self.gamma * self.state_values[s + a]) + \
(1 - self.ph) * (self.gamma * self.state_values[s - a])
action_returns.append(ret)
argmax_a = actions[np.argmax(np.round(action_returns,5))]
self.policy[s] = argmax_a
if policy_stable and argmax_a != old_a:
policy_stable = False
return policy_stable
值函数和最优策略 P h = 0.4 P_h = 0.4 Ph=0.4
方法二 :Value Iteration
- Initialization:
- V ( s ) ∈ R a n d π ( s ) ∈ A ( s ) a r b i t r a r i l y f o r a l l s ∈ S V(s) \in R \;\; and \,\, \pi(s) \in \mathcal A(s) \; arbitrarily \; for \; all \; s \in \mathcal S V(s)∈Randπ(s)∈A(s)arbitrarilyforalls∈S
- Truncated Policy Evaluation:
- Loop:
- Δ ← 0 \Delta \gets 0 Δ←0
- Loop for each
s
∈
S
s \in \mathcal S
s∈S:
- v ← V ( s ) v \gets V(s) v←V(s)
- V ( s ) ← m a x a ∑ s ′ , r p ( s ′ , r ∣ s , π ( s ) ) [ r + γ V ( s ′ ) ] V(s) \gets \mathop{max}\limits_a\sum_{s',r}p(s',r|s,\pi(s))[r + \gamma V(s')] V(s)←amax∑s′,rp(s′,r∣s,π(s))[r+γV(s′)]
- Δ ← m a x ( Δ , ∣ v − V ( s ) ∣ ) \Delta \gets max(\Delta,|v - V(s)|) Δ←max(Δ,∣v−V(s)∣)
- until Δ < = θ \Delta <= \theta Δ<=θ (a small positive number determining the accuracy of estimation)
- Loop:
- Optimal policy,
π
≈
π
∗
\pi \approx \pi_*
π≈π∗:
- π ( s ) = a r g m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] \pi(s) = \mathop{argmax}\limits_{a}\sum_{s',r}p(s',r|s,a)[r + \gamma V(s')] π(s)=aargmax∑s′,rp(s′,r∣s,a)[r+γV(s′)]
class ValueIteration:
def __init__(self, goal = 100, proba_h = 0.4, theta=1e-9, gamma = 1):
self.ph = proba_h
self.gamma = gamma
self.goal = goal
self.theta = theta
self.states = np.arange(self.goal + 1)
self.state_values = np.zeros(self.goal + 1)
self.state_values[self.goal] = 1.0
self.policy = np.zeros(self.goal + 1)
self.sweeps_history = []
def truncated_policy_evaluation(self):
while True:
old_state_values = self.state_values.copy()
self.sweeps_history.append(old_state_values)
for s in self.states[1:self.goal]:
actions = np.arange(min(s, self.goal - s))+1
action_returns = []
for a in actions:
ret = self.ph * (self.gamma * self.state_values[s + a]) + \
(1 - self.ph) * (self.gamma * self.state_values[s - a])
action_returns.append(ret)
self.state_values[s] = np.max(action_returns) # max_a
delta = abs(self.state_values - old_state_values).max()
if delta <= self.theta:
break
def optimal_policy(self):
for s in self.states[1:self.goal]:
actions = np.arange(min(s, self.goal - s))+1
old_a = self.policy[s]
action_returns = []
for a in actions:
ret = self.ph * (self.gamma * self.state_values[s + a]) + \
(1 - self.ph) * (self.gamma * self.state_values[s - a])
action_returns.append(ret)
argmax_a = actions[np.argmax(np.round(action_returns,5))]
self.policy[s] = argmax_a
值函数和最优策略 P h = 0.4 P_h = 0.4 Ph=0.4
值函数和最优策略 P h = 0.55 P_h = 0.55 Ph=0.55
结果对比
相同胜率下:
P
h
=
0.4
P_h = 0.4
Ph=0.4
上述结果中,在相同的胜率下,策略迭代和值迭代得到了不同的最优策略,两个策略之间约有1/4是完全重合的,
π
p
i
\pi_{pi}
πpi,呈现出分段式的激进与保守。
π
v
i
\pi_{vi}
πvi则是周期性的激进与保守。
关键结论:
- Truncated Policy Evaluation方法值函数收敛的速度更快,减少遍历全部状态的次数。
- Policy Iteration 和 Value Iteration 的主要区别:
- PI: v k + 1 ( s ) = E [ R t + 1 + γ v k ( S t + 1 ) ∣ S t = s , A t = a ] v_{k+1}(s) = E[R_{t+1} + \gamma v_k(S_{t+1})|S_t = s, A_t = a] vk+1(s)=E[Rt+1+γvk(St+1)∣St=s,At=a],所有可能动作的期望
- VI: v k + 1 ( s ) = m a x a E [ R t + 1 + γ v k ( S t + 1 ) ∣ S t = s , A t = a ] v_{k+1}(s) = \mathop{max}\limits_{a} E[R_{t+1} + \gamma v_k(S_{t+1})|S_t = s, A_t = a] vk+1(s)=amaxE[Rt+1+γvk(St+1)∣St=s,At=a],最大动作的期望