1. The Agent-Environment Interface, Goals and Rewards
The reinforcement learning problem is meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision-maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to those actions and presenting new situations to the agent. The environment also gives rise to rewards, special numeric values that the agent tries to maximize over time. A complete specification of an environment defines a task, one instance of the reinforcement learning problem.
More specifically, the agent and environment interact at each of a sequence of discrete time steps, t = 0, I , 2, 3,. . . . At each time step t, the agent receives some representation of the environment s state, stS, where S is the set of possible states, and on that basis selects an action, atA(st), where A (st) is the set of actions available in state st. One time step later, in part as a consequence of its action, the agent receives a numerical reward, rt 1R, and finds itself in a new state, st 1.
At each time step, the agent implements a mapping from states to probabilit of selecting each possible action. This mapping is called the agent s policy and is denoted pi;t,, where pi;t(s, a) is the probability that at=a if st=s. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agents goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.
In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special reward signal passing from the environment to the agent. At each time step, the reward is a simple number, rt R. Informally, the agents goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run.
The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning. Although this way of formulating goals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider examples of how it has been, or could be, used. For example, to make a robot learn to walk, researchers have provided reward on each time step proportional to the robot s forward motion. In making a robot learn how to escape from a maze, the reward is often zero until it escapes, when it becomes 1. Another common approach in maze learning is to give a reward of -1 for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible. To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of 1 for each can collected ( and confirmed as empty ) . One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it. For an agent to learn to play checkers or chess, the natural rewards are 1 for winning, -1 for losing, and 0 for drawing and for all nontenninal positions.
You can see what is happening in all of these examples. The agent always learns to maximize its reward. If we want it to do something for us, we must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do. For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponents pieces or gaining control of the center of the board . If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent s pieces even at the cost of losing the game. The reward signal is your way of communicating to the agent what you want it to achieve, not how you want it achieved.
2. Markov Decision Processes
In the reinforcement learning framework, the agent makes its decisions as a function of a signal from the environment called the environments state. In this section we discuss what is required of the state signal, and what kind of information we should and should not expect it to provide. In particular, we formally define a property of environments and their state signals that is of particular interest, called the Markov property.
In this book, by 'the state' we mean whatever information is available to the agent. We assume that the state is given by some preprocessing system that is nominally part of the environment. We do not address the issues of constructing, changing, or learning the state signal in this book. We take this approach not because we consider state representation to be unimportant, but in order to focus fully on the decision-making issues. In other words, our main concern is not with designing the state signal, but with deciding what action to take as a function of whatever state signal is available.
Certainly the state signal should include immediate sensations such as sensory measurements, but it can contain much more than that. State representations can be highly processed versions of original sensations, or they can be complex structures built up over time from the sequence of sensations. For example, we can move our eyes over a scene, with only a tiny spot corresponding to th
1. agent与环境的接口、目标与奖赏:
更具体地说,agent和环境在一个离散时间序列(t=0,1,2,3,hellip;hellip;)的每一步中都进行交互。在每个时间步t,agent都得到若干环境状态(state)的表示stS,其中S是所有可能状态的集合,在此基础上选择一个动作(action)atA(st),其中A(st)是在状态st上的可选动作的集合。一个时间步过后,该动作的结果是:agent得到一个数值奖赏(reward)rt 1R,并到达一个新的状态st 1。
2. 马尔可夫决策过程
理想中,我们所喜欢的是一个状态能够简练地总结过去的感觉,而这种方式又能保留所有相关的信息。这通常要求比直接感觉更多的东西,但从来不要求全部过去感觉的历史。成功保留所有相关信息的状态信号可以说成是马尔可夫(Markov)的,或者有马尔可夫性(the Markov property)(下面我们会正式定义该性质)。举例来说,一个棋局(棋盘上所有棋子的当前布局)就可以当成是一个马尔可夫状态,因为它汇集了所有导致它当前这个局面的完整棋局序列的一切重要内容。虽然关于这个序列的很多信息丢失了,但是所有与这个游戏未来紧密相关的重要东西保留了下来。同样,一个炮弹的当前位置和速度是与它将来的飞行有关的东西,而与位置和速度是怎么来的无关。这也就是有时候指的“路径的独立”性,因为所有有关的信息都在当前状态信号中,它的含义是独立于导致当前局面的信号的“路径”或历史。
我们现在正式定义强化学习问题的马尔可夫性。为了使数学上简单,这里我们假设有有穷个状态和奖赏值。这样我们可以用求和和概率,而不用积分和概率密度来做,但是这个问题可以轻易的扩展到包括连续状态和奖赏的问题中。思考一下一个普通环境可能会在t 1时刻对t时刻所做的动作如何反应。在最普通的、有前因后果的情况中,这个反应可能依赖于前面发生的一切。这种情况下,这个动态性可能只能通过指定完整的概率分布来定义:
对所有s,r,以及所有过去事件中的可能值:st,at,rt,st-1,at-1,...,r1,s0,a0。另一方面,如果状态信号有马尔可夫性,那么环境在t 1的响应只取决于在t时刻的状态和动作的表示,在此情况下,环境的动态性可以通过只指定下式来定义
对所有的s, r, st和at。换句话说,当且仅当对所有s,r,以及历史st, at, rt, st-1, at-1, ..., r1, s0, a0,有(2.2)式等于(2.1)式,那么状态信号有马尔可夫性,是一个马尔可夫状态。在这种情况下,环境和任务作为一个整体也具有马尔可夫性。
满足马尔可夫性质的强化学习任务被称为是马尔可夫决策过程(Markov decision process)或MDP。如果状态和动作空间是有穷的,那么它就称为有穷马尔可夫决策过程(finite Markov decision process,有穷MDP)。有穷MDP对强化学习理论来说尤其重要。我们将在整本书中都提到它们,它们占你需要理解的全部强化学习的90%。
这称为转换概率(transition probability)。同样,给定任意当前状态s和动作a,以及任意下一状态s,则下一奖赏的期望值为:
3. 最优值函数:
大体来说,解决一个强化学习任务意味着寻找一个长期运行过程中获得许多奖赏的策略。对有穷MDP,我们可以用下列方式精确定义一个最优策略。值函数定义了策略的偏序。定义:对所有状态,如果策略pi;的期望回报大于或等于策略pi;的期望回报,那么我们就说策略pi;要优于或者等于策略pi;。换句话说,当且仅当对所有,有时pi;ge;pi;。总是至少会有一个策略要优于或者等于其他策略的,这就是一个最优策略。尽管最优策略可能不只一个,我们用pi;*来表示所有的最优策略。它们有相同的状态值函数,称为最优状态值函数(optimal state-value function),记为V*,定义如下:
对所有, (3.1)
最优策略同样也有相同的最优动作值函数(optimal action-value function),记为Q*,定义如下:
对所有的sS和aA(s), (3.2)
4. TD预测
图4.1 表格式TD(0)估计
图4.1以程序形式完整说明了TD(0)算法,图4.2给出了它的更新图。更新图顶部状态节点的值的估计是在一个从它到它的直接下一状态的抽样转换基础上更新的。我们提到TD和蒙特卡罗更新时都称为抽样更新(sample backup),因为它们都涉及到