We first came to focus on what is now known as reinforcement learning in late
1979. We were both at the University of Massachusetts, working on one of
the earliest projects to revive the idea that networks of neuronlike adaptive
elements might prove to be a promising approach to articial adaptive intelligence.
The project explored the \heterostatic theory of adaptive systems"
developed by A. Harry Klopf. Harry's work was a rich source of ideas, and
we were permitted to explore them critically and compare them with the long
history of prior work in adaptive systems. Our task became one of teasing
the ideas apart and understanding their relationships and relative importance.
This continues today, but in 1979 we came to realize that perhaps the simplest
of the ideas, which had long been taken for granted, had received surprisingly
little attention from a computational perspective. This was simply the idea of
a learning system that wants something, that adapts its behavior in order to
maximize a special signal from its environment. This was the idea of a \hedonistic"
learning system, or, as we would say now, the idea of reinforcement
learning.
1979年,早期在麻省理工研究类神经网络自适应元素可能证明是一个在人工智能上有前景的方法时,我们第一次关注我们现在所熟知的增强学习。
整个项目探索依据为Harry Klopf提出的适用系统的异质性理论,Harry Klopf的研究给我们提供了丰富的想法来源并且我们坚持批判性的探索并和自适应系统历史经验对比。我们的任务变成了将这些想法分开并且理解他们的关系和相对的重要性
这个一直持续到今天,但是在1979年我们认识到调整自己的行为来最大化来自环境的特殊信号可能是最简单的想法。这个idea经常被认为是理所当然,也经常在计算方面得到很少的注意。然而他确实学习东西最简单的idea,这基本上是享乐主义学习系统,也就是我们说的增强学习理念。
“
我们最早在1979 年末关注于现在被称为强化学习的学科. 我们在University of Massachusetts,
工作于最早的想要使“类似神经元的自适应元素组成的网络, 可能是通往自适应
人工智能的一条充满前景的道路” 这一理念复兴的项目. 这一项目探索了由A. Harry Klopf
发展的“自适应系统的多稳态<heterostatic> 理论”. Harry 的工作是许多想法的丰富源泉,
因而我们能够批判性地探索这些想法, 并将它们和拥有长久历史的关于自适应系统的先驱工
作相比较. 我们的工作变成了这两样: 将这些想法梳理开来, 或理解这些想法的关系与相对
的重要性. 这一直持续到今天, 但在1979 年我们突然意识到这其中最简单的想法, 也是长久
以来一直被认为是理所当然的想法, 几乎没有从计算的角度受到过关注. 这个简单的想法就
是: 学习系统想要一些东西, 或学习系统为了从环境中最大化某一个特殊的信号而适应性地
改变行为. 这就是“享乐主义”<hedonistic> 学习系统的理念, 或者用现在的话说, 强化学习
的理念.
”
Like others, we had a sense that reinforcement learning had been thoroughly
explored in the early days of cybernetics and articial intelligence. On
closer inspection, though, we found that it had been explored only slightly.
While reinforcement learning had clearly motivated some of the earliest computational
studies of learning, most of these researchers had gone on to other
things, such as pattern classication, supervised learning, and adaptive control,
or they had abandoned the study of learning altogether. As a result, the
special issues involved in learning how to get something from the environment
received relatively little attention. In retrospect, focusing on this idea was
the critical step that set this branch of research in motion. Little progress
could be made in the computational study of reinforcement learning until it
was recognized that such a fundamental idea had not yet been thoroughly
explored.
像其他人一样,我们下意识的认为RL已经被全面的探索了在控制系统和人工智能领域,在最近的调查中我们发现实际它只是被简单的探索。
尽管RL已经明显激发了早期计算研究的学习,大多数研究人员已经转去了别的领域比如模式分类、监督学习、自适应控制或者完全抛弃了学习理论研究
结果,涉及到如何从环境中获取信息这个特殊问题得到很少的关注,回想起来,聚焦在这个idea上是解决问题这个研究分支的关键点,除非这个idea被充分的探索否则在RL的计算研究上将会进展甚微
“
在那时, 我们像其他人那样有一种这样的感觉: 强化学习已经在神经机械学与人工智能
发展的早期被彻底地探索过了. 然而在更近地审视后, 我们发现这探索仅是浅显的. 虽然强
化学习很明显地催生了一些最早的对学习理论的计算方面的研究, 但大多数研究者转向了其
他事物: 例如模式分类, 监督学习, 自适应控制理论, 或者干脆放弃了学习理论的研究. 结
果涉及到“怎样从环境中获得某物” 的特定问题相对而言几乎没有受到关注. 回想起来, 对
这一理念的关注, 是使得这一分支的研究蓬勃发展的决定性的一步. 如果没有察觉到这样一
个基本的理念还没有被彻底探索过的话, 关于强化学习的计算方面的研究是不可能开展起来
的.
”
The eld has come a long way since then, evolving and maturing in several
directions. Reinforcement learning has gradually become one of the most
active research areas in machine learning, articial intelligence, and neural network
research. The eld has developed strong mathematical foundations and
impressive applications. The computational study of reinforcement learning is
now a large eld, with hundreds of active researchers around the world in diverse
disciplines such as psychology, control theory, articial intelligence, and
neuroscience. Particularly important have been the contributions establishing
and developing the relationships to the theory of optimal control and dynamic
programming. The overall problem of learning from interaction to achieve
goals is still far from being solved, but our understanding of it has improved
signicantly. We can now place component ideas, such as temporal-dierence
learning, dynamic programming, and function approximation, within a coherent
perspective with respect to the overall problem.
这个领域已经发展很长一段时间,在很多方面进化和成熟,RL已经逐渐变成了一个最活跃的研究分支在机器学习,人工智能、神经网络研究。这些领域生成了严谨的数学基础和有影响的应用,RL的计算研究目前是一个大的领域,世界范围内有成千活跃的在各个领域的研究人员比如心理学、控制理论、人工智能和神经科学。特别重要的贡献在建立和发展了最优控制理论和动态系统。
从相互作用中学习总体问题去完成目标远远没有解决,但是我们对他的了解已经有很大意义的提高。我们现在在关于整体问题使用连续的视角可以生成一些想法比如时间差分、动态规划这些基本的方法
”
从那时起强化学习领域已经有了长足的发展, 并在几个方面上逐渐进步与成熟. 强化学
习逐渐成为了机器学习、人工智能以及神经网络方面的最为活跃的研究领域. 这一领域发展
出了坚实的数学基础与令人印象深刻的应用. 强化学习的计算方面的研究现在是一个宽广
的领域, 有来自全球的数以百计的心理学、控制论、人工智能与神经科学等领域的活跃研究者
参与其中. 其中关于“建立与发展最优控制理论与动态规划这两者关系” 的贡献是至关重要
的. 从交互中学习以达成目标这一整体问题还远没有解决, 但我们对它的理解已经极大地改
善了. 我们现在能将时序差分<temporal-difference> 学习、动态规划、函数近似等组成概念
聚合到关于这整个问题的统一视图中.
“
Our goal in writing this book was to provide a clear and simple account of
the key ideas and algorithms of reinforcement learning. We wanted our treatment
to be accessible to readers in all of the related disciplines, but we could
not cover all of these perspectives in detail. For the most part, our treatment
takes the point of view of articial intelligence and engineering. In this second
edition, we plan to have one chapter summarizing the connections to psychology
and neuroscience, which are many and rapidly developing. Coverage of
connections to other elds we leave to others or to another time. We also
chose not to produce a rigorous formal treatment of reinforcement learning.
We did not reach for the highest possible level of mathematical abstraction
and did not rely on a theorem{proof format. We tried to choose a level of
mathematical detail that points the mathematically inclined in the right directions
without distracting from the simplicity and potential generality of the
underlying ideas.
我们写这本书的目标是提供清晰简单的RL关键idea和算法,我们想我们的解决方案对于来自不同领域的读者来说是可用的,但是我们不会详细覆盖到每个方面。在大多数章节,我们的方案采用了神经网络和工程学的观点。在第二个版本我们打算使用一章来阐述当前在迅速发展的心理学和神经学之间的联系。
覆盖别的领域的联系我们留到别的章节和别的时间。我们也不会选择严格的形式也不会去达到最大程度的数学抽象,不会依旧理论证明的形式。我们尽量选择一个偏向于不会分散简单性和基础想法潜在可能性的数学方法
”
我们书写这本书的目的, 是为了提供一个清晰明了的对强化学习的关键概念与算法的阐
述. 我们希望我们的阐述能被关联领域的读者理解, 但我们不会详尽地涵盖所有这些领域.
就绝大部分而言, 我们的阐述是从人工智能与工程的角度出发的. 对其他关联领域的涵盖,
我们要么留给其他作者, 要么以后再论. 我们也选择不以严谨而正式的方式阐述强化学习.
我们没有到达可能的最高级别的数学抽象, 也没有依赖于定理-证明<theorem-proof> 体系.
我们试图选择呈现特定级别的的数学细节: 这样程度的细节既能指出正确的数学方向, 又能
避免破坏根本概念的简洁性与潜在的普遍性.
“
The book consists of three parts. Part I is introductory and problem oriented.
We focus on the simplest aspects of reinforcement learning and on its
main distinguishing features. One full chapter is devoted to introducing the
reinforcement learning problem whose solution we explore in the rest of the
book.
本书包含三个部分,第一个部分是简介和面临的问题,我们聚焦在最简单和RL主要的特征,整个章节致力于介绍RL问题,这些问题的解决方案我们已经在剩余的章节里探索的解决方案
1、
in progress:
Second edition, in progress
2、
The MIT Press
Press:出版,MIT:麻省理工
Massachusetts Institute of Technology
Cambridge :剑桥
3、in memory of:为了纪念:
In memory of A. Harry Klopf
4、contents:目录
Preface:前言
5、Notation 符号
Summary of Notation,General summary of notation flows contents
6、Scope:范围
Limitations and scope
7、extended:扩展
An extended example
8、Bibliographical:参考
Bibliographical remarks 参考
9、revive 复活
Revive the idea
10、neuronlike:类神经网络
Revive the idea that networks of neuronlike adaptive elements
11、adaptive :自适应的
Adaptive elements might prove to be a promising approach to artifical adaptive intelligence
12:artifical:人工的
See above
13、heterostatic : 异位差的
Heterostatic theory of adaptive intelligence---harry klopf
14、be permitted to
We are permitted to explore them critically
15、prior:预先的
The long history of Prior work in adaptive systems
16、tease: 挑逗
One of teasing the ideas apart
17、granted: 理所当然
Which had long been taken for granted
18、computational : 计算的
Perspective:角度
Had received surprisingly little attention from a computational perspective
19、hedonistic:享乐主义
This was the idea of hedonistic learning system
20、as we would say now:正如我们接下来所说
21、cybernetics:控制论
22、On closer inspection 仔细观察
On closer inspection ,though, we found that it had been explored only slightly
23、restrospect:回想
In restrospect, focusing on this idea was the critical step that set this branch of research in motion
Motion:行动
24、fundamental:基础的
Such a fundamental idea had not ye been thoroughly explored
Thoroughtly:彻底地
25、evolve:进化
Evolving and maturing in several directions
26、diverse disciplines :多元化学科
With hundreds of active researchers around the world in diverse disciplines
27、optimal:最佳的
Optimal control and dynamic programming
28、overall:总体的
The overall problem of learning from
29、component:零件
We can now place component ideas
30、temporal-difference :时间差分
31、coherent:连贯的 perspective:视角 respect:关于
Within a coherent perspective with respect to the overall problem
32、accessible:可使用的
We wanted our treatment to be accessible to reader in all of the related disciplines
33、rigoros:严格的
We also chose not to produce a rigous formal treatment of RL
34、theorem-proof:定理证明
Did not reply on a theorem-proof format
35、inclined:倾斜 underlying:底下的
Inclined in the right directions without distracting from the simplicity and potential generality of
The underlying ideas
36、supplemented:辅助
Perhaps supplemented by reading from the literature
37、subset:子集
Cover only a subset of material
38、in sequence:按照顺序
39、omit:省略
These can be omitted on first reading without creating problems
40、elementary:初级的
41、substantially:基本上
42、manuals:手册
Solution manuals are avalible to instructors
43、relevant:相关的
Revelant historical background
44、authoritative:权威的
Make these sections authoritative and complete
45、critically:批判性