加群(AI学习交流群):
PPO 效果
前提
τ ~ p(τ) 是轨迹分布
t∈[0,T-1] 是一条轨迹的步骤数
策略 π 是动作 a 的概率分布
State-Action Value Function 简称 V(st) 函数
V π ( s t ) = E τ ∼ p ( τ ) [ R ( τ t : T ) ∣ τ s t = s t ] V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [R(\tau_{t:T}) | \tau_{s_{t}}=s_{t}] Vπ(st)=Eτ∼p(τ)[R(τt:T)∣τst=st]
V π ( s t ) = E τ ∼ p ( τ ) [ r ( s t ) + γ r t + 1 + γ 2 r t + 2 + . . . ] V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [ r(s_{t}) + \gamma r_{t+1} + \gamma^2 r_{t+2}+... ] Vπ(st)=Eτ∼p(τ)[r(st)+γrt+1+γ2rt+2+...]
V(st)函数的贝尔曼方程:
V π ( s t ) = E τ ∼ p ( τ ) [ r ( s t ) + γ V π ( s t + 1 ) ] V^{\pi} (s_{t}) = E_{\tau \sim p(\tau )} [r(s_{t}) + \gamma V^{\pi} (s_{t+1}) ] Vπ(st)=Eτ∼p(τ)[r(st)+γVπ(