马尔科夫决策过程(MDP):赌徒问题

本文探讨了一个基于抛硬币的赌博游戏,通过离散的马尔可夫决策过程进行建模。介绍了策略迭代和值迭代两种方法来寻找最优策略。在相同胜率(0.4)下,两种方法得到的最优策略有所不同,策略迭代呈现分段式的激进与保守,而值迭代则表现出周期性的激进与保守。值函数收敛速度方面,截断策略评估更快。这两种方法的主要区别在于策略迭代考虑所有可能动作的期望,而值迭代则寻找最大动作的期望。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题描述:

一个赌徒玩一个游戏,输赢由抛硬币的结果来决定,每一局游戏开始,都必须拿出部分的赌资下注。如果硬币头朝上,他赌多少赢多少(double)。如果头朝下,就输掉赌注。当赌徒达到100美元的目标时,或者缺钱时,游戏结束。

问题抽象

问题可以看作一个离散的马尔可夫决策过程:

  • s s s : state,赌徒当前持有的赌资, s ∈ S = { 0 , 1 , 2 , . . . 99 } s\in \mathcal{S}=\{0,1,2,...99\} sS={0,1,2,...99}
  • a a a : action,赌徒每次下注的金额, a ∈ A = { 1 , 2 , . . . m i n ( s , 100 − s ) } a\in \mathcal{A}=\{1,2,...min(s,100-s)\} aA={1,2,...min(s,100s)}
  • r r r : reward,只有达到目标时为1,其他均为0, r ∈ R = { 0 , 1 } r \in \mathcal{R}=\{0,1\} rR={0,1}
  • γ \gamma γ : discount, γ = 1 \gamma = 1 γ=1
  • p h p_h ph : 硬币头朝上的概率

求解策略

方法一 :Policy Iteration
  1. Initialization:
    • V ( s ) ∈ R      a n d    π ( s ) ∈ A ( s )    a r b i t r a r i l y    f o r    a l l    s ∈ S V(s) \in R \;\; and \,\, \pi(s) \in \mathcal A(s) \; arbitrarily \; for \; all \; s \in \mathcal S V(s)Randπ(s)A(s)arbitrarilyforallsS
  2. Policy Evaluation:
    • Loop:
      • Δ ← 0 \Delta \gets 0 Δ0
      • Loop for each s ∈ S s \in \mathcal S sS:
        • v ← V ( s ) v \gets V(s) vV(s)
        • V ( s ) ← ∑ s ′ , r p ( s ′ , r ∣ s , π ( s ) ) [ r + γ V ( s ′ ) ] V(s) \gets \sum_{s',r}p(s',r|s,\pi(s))[r + \gamma V(s')] V(s)s,rp(s,rs,π(s))[r+γV(s)]
        • Δ ← m a x ( Δ , ∣ v − V ( s ) ∣ ) \Delta \gets max(\Delta,|v - V(s)|) Δmax(Δ,vV(s))
    • until Δ < = θ \Delta <= \theta Δ<=θ (a small positive number determining the accuracy of estimation)
  3. Policy Improvement:
    • policy-stable ← \gets true
    • For each s ∈ S s \in S sS:
      • old-action ← π ( s ) \gets \pi(s) π(s)
      • π ( s ) ← a r g m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] \pi(s) \gets \mathop{argmax}\limits_{a}\sum_{s',r}p(s',r|s,a)[r + \gamma V(s')] π(s)aargmaxs,rp(s,rs,a)[r+γV(s)]
      • If old-action ≠ π ( s ) \neq \pi(s) =π(s),then policy=stable ← \gets false
    • If policy-stable,then stop and return V ≈ v ∗    a n d    π ≈ π ∗ V \approx v_* \; and \; \pi \approx \pi_* Vvandππ; esle go to 2
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
class PolicyIteration:
    def __init__(self, goal = 100, proba_h = 0.4, theta=1e-9, gamma = 1):
        self.ph = proba_h
        self.gamma = gamma
        self.goal = goal
        self.theta = theta
        self.states = np.arange(self.goal + 1)
        self.state_values = np.zeros(self.goal + 1)
        self.state_values[self.goal] = 1.0
        self.policy = np.zeros(self.goal + 1)
        self.sweeps_history = []
        
    def policy_evaluation(self):
        while True:
            old_state_values = self.state_values.copy()
            self.sweeps_history.append(old_state_values)
            for s in self.states[1:self.goal]:
                actions = np.arange(min(s, self.goal - s))+1
                action_returns = []
                for a in actions:
                    ret = self.ph * (self.gamma * self.state_values[s + a]) + \
                          (1 - self.ph) * (self.gamma * self.state_values[s - a])
                    action_returns.append(ret)
                self.state_values[s] = np.mean(action_returns)
            delta = abs(self.state_values - old_state_values).max()
            if delta <= self.theta:
                break
    
    def policy_improvement(self):
        policy_stable = True
        for s in self.states[1:self.goal]:
            actions = np.arange(min(s, self.goal - s))+1
            old_a = self.policy[s]
            action_returns = []
            for a in actions:
                ret = self.ph * (self.gamma * self.state_values[s + a]) + \
                      (1 - self.ph) * (self.gamma * self.state_values[s - a])
                action_returns.append(ret)
            argmax_a = actions[np.argmax(np.round(action_returns,5))]
            self.policy[s] = argmax_a
            if policy_stable and argmax_a != old_a:
                policy_stable = False
        return policy_stable
值函数和最优策略 P h = 0.4 P_h = 0.4 Ph=0.4

在这里插入图片描述

方法二 :Value Iteration
  1. Initialization:
    • V ( s ) ∈ R      a n d    π ( s ) ∈ A ( s )    a r b i t r a r i l y    f o r    a l l    s ∈ S V(s) \in R \;\; and \,\, \pi(s) \in \mathcal A(s) \; arbitrarily \; for \; all \; s \in \mathcal S V(s)Randπ(s)A(s)arbitrarilyforallsS
  2. Truncated Policy Evaluation:
    • Loop:
      • Δ ← 0 \Delta \gets 0 Δ0
      • Loop for each s ∈ S s \in \mathcal S sS:
        • v ← V ( s ) v \gets V(s) vV(s)
        • V ( s ) ← m a x a ∑ s ′ , r p ( s ′ , r ∣ s , π ( s ) ) [ r + γ V ( s ′ ) ] V(s) \gets \mathop{max}\limits_a\sum_{s',r}p(s',r|s,\pi(s))[r + \gamma V(s')] V(s)amaxs,rp(s,rs,π(s))[r+γV(s)]
        • Δ ← m a x ( Δ , ∣ v − V ( s ) ∣ ) \Delta \gets max(\Delta,|v - V(s)|) Δmax(Δ,vV(s))
    • until Δ < = θ \Delta <= \theta Δ<=θ (a small positive number determining the accuracy of estimation)
  3. Optimal policy, π ≈ π ∗ \pi \approx \pi_* ππ:
    • π ( s ) = a r g m a x a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ( s ′ ) ] \pi(s) = \mathop{argmax}\limits_{a}\sum_{s',r}p(s',r|s,a)[r + \gamma V(s')] π(s)=aargmaxs,rp(s,rs,a)[r+γV(s)]
class ValueIteration:
    def __init__(self, goal = 100, proba_h = 0.4, theta=1e-9, gamma = 1):
        self.ph = proba_h
        self.gamma = gamma
        self.goal = goal
        self.theta = theta
        self.states = np.arange(self.goal + 1)
        self.state_values = np.zeros(self.goal + 1)
        self.state_values[self.goal] = 1.0
        self.policy = np.zeros(self.goal + 1)
        self.sweeps_history = []
        
    def truncated_policy_evaluation(self):
        while True:
            old_state_values = self.state_values.copy()
            self.sweeps_history.append(old_state_values)
            for s in self.states[1:self.goal]:
                actions = np.arange(min(s, self.goal - s))+1
                action_returns = []
                for a in actions:
                    ret = self.ph * (self.gamma * self.state_values[s + a]) + \
                          (1 - self.ph) * (self.gamma * self.state_values[s - a])
                    action_returns.append(ret)
                self.state_values[s] = np.max(action_returns)   # max_a
            delta = abs(self.state_values - old_state_values).max()
            if delta <= self.theta:
                break
    
    def optimal_policy(self):
        for s in self.states[1:self.goal]:
            actions = np.arange(min(s, self.goal - s))+1
            old_a = self.policy[s]
            action_returns = []
            for a in actions:
                ret = self.ph * (self.gamma * self.state_values[s + a]) + \
                      (1 - self.ph) * (self.gamma * self.state_values[s - a])
                action_returns.append(ret)
            argmax_a = actions[np.argmax(np.round(action_returns,5))]
            self.policy[s] = argmax_a
值函数和最优策略 P h = 0.4 P_h = 0.4 Ph=0.4

在这里插入图片描述

值函数和最优策略 P h = 0.55 P_h = 0.55 Ph=0.55

在这里插入图片描述

结果对比

相同胜率下: P h = 0.4 P_h = 0.4 Ph=0.4
在这里插入图片描述
上述结果中,在相同的胜率下,策略迭代和值迭代得到了不同的最优策略,两个策略之间约有1/4是完全重合的, π p i \pi_{pi} πpi,呈现出分段式的激进与保守。 π v i \pi_{vi} πvi则是周期性的激进与保守。

在这里插入图片描述

关键结论:
  1. Truncated Policy Evaluation方法值函数收敛的速度更快,减少遍历全部状态的次数。
  2. Policy Iteration 和 Value Iteration 的主要区别:
    • PI: v k + 1 ( s ) = E [ R t + 1 + γ v k ( S t + 1 ) ∣ S t = s , A t = a ] v_{k+1}(s) = E[R_{t+1} + \gamma v_k(S_{t+1})|S_t = s, A_t = a] vk+1(s)=E[Rt+1+γvk(St+1)St=s,At=a],所有可能动作的期望
    • VI: v k + 1 ( s ) = m a x a E [ R t + 1 + γ v k ( S t + 1 ) ∣ S t = s , A t = a ] v_{k+1}(s) = \mathop{max}\limits_{a} E[R_{t+1} + \gamma v_k(S_{t+1})|S_t = s, A_t = a] vk+1(s)=amaxE[Rt+1+γvk(St+1)St=s,At=a],最大动作的期望
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值