由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
13.1 Policy Approximation and its Advantages . . . . . . . . . . . . . . . . 266
13.2 The Policy Gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . 268
13.3 REINFORCE: Monte Carlo Policy Gradient . . . . . . . . . . . . . . . 270
13.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . . 272
13.5 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.6 Policy Gradient for Continuing Problems (Average Reward Rate) . . . 275
13.7 Policy Parameterization for Continuous Actions . . . . . . . . . . . . . 278
learn a parameterized policythat can select actions directly;A value function may still be used tolearnthe policy weights, but is not required for action selection.
Policy Gradient Methods方法的优缺点:
1)the policy may be a simpler function to approximate than value-function
2)Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating methods can
3)policy parameterization is sometimes a good way of injecting prior knowledge
Better convergence properties(步长足够小,policy gradient保证不断优化policy,相应的,往往陷入local optimum)
Effective in high-dimensional or continuous action spaces(Q需要max over action)
Can learn stochastic policies(max Q over action可以近似看做是deterministic policy)
Disadvantages:
Typically converge to a local rather than globa