强化学习(ppt文档可编辑修改).ppt

资源描述

高级人工智能第十章第十章史忠植史忠植史忠植史忠植中国科学院计算技术研究所中国科学院计算技术研究所中国科学院计算技术研究所中国科学院计算技术研究所强化学习2024/5/25 周六1强化学习史忠植内容提要l引言引言l强化学习模型强化学习模型l动态规划动态规划l蒙特卡罗方法蒙特卡罗方法l时序差分学习时序差分学习lQ学习学习l强化学习中的函数估计强化学习中的函数估计l应用应用2024/5/25 周六2强化学习史忠植引言人类通常从与外界环境的交互中学习。所谓强化（reinforcement）学习是指从环境状态到行为映射的学习，以使系统行为从环境中获得的累积奖励值最大。在强化学习中，我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励，而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。2024/5/25 周六3强化学习史忠植引言强化学习技术是从控制理论、统计学、心理学等相关学科发展而来，最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用，并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后，对强化学习的研究和应用日益开展起来，成为目前机器学习领域的研究热点之一。2024/5/25 周六4强化学习史忠植引言引言o强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下让动物感到舒服的行为，就会与此情景增强联系（强化），当此情景再现时，动物的这种行为也更易再现；相反，让动物感觉不舒服的行为，会减弱与情景的联系，此情景再现时，此行为将很难再现。换个说法，哪种行为会“记住”，会与刺激建立联系，取决于行为产生的效果。o动物的试错学习,包含两个含义：选择（selectional）和联系（associative），对应计算上的搜索和记忆。所以，1954年，Minsky在他的博士论文中实现了计算上的试错学习。同年，Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligence”，此后开始广泛使用。1969年，Minsky因在人工智能方面的贡献而获得计算机图灵奖。2024/5/25 周六5强化学习史忠植引言引言o1953到1957年，Bellman提出了求解最优控制问题的一个有效方法：动态规划（dynamic programming）oBellman于 1957年还提出了最优控制问题的随机离散版本，就是著名的马尔可夫决策过程（MDP,Markov decision processe），1960年Howard提出马尔可夫决策过程的策略迭代方法，这些都成为现代强化学习的理论基础。o1972年，Klopf把试错学习和时序差分结合在一起。1978年开始，Sutton、Barto、Moore，包括Klopf等对这两者结合开始进行深入研究。o1989年Watkins提出了Q-学习Watkins 1989，也把强化学习的三条主线扭在了一起。o1992年，Tesauro用强化学习成功了应用到西洋双陆棋（backgammon）中，称为TD-Gammon。2024/5/25 周六6强化学习史忠植内容提要l引言引言l强化学习模型强化学习模型l动态规划动态规划l蒙特卡罗方法蒙特卡罗方法l时序差分学习时序差分学习lQ学习学习l强化学习中的函数估计强化学习中的函数估计l应用应用2024/5/25 周六7强化学习史忠植主体主体主体主体强化学习模型i:inputr:reward s:statea:action状态 sisi+1ri+1奖励 ri环境环境环境环境动作动作 aia0a1a2s0s1s2s32024/5/25 周六8强化学习史忠植描述一个环境描述一个环境（问题）oAccessible vs.inaccessibleoDeterministic vs.non-deterministicoEpisodic vs.non-episodicoStatic vs.dynamicoDiscrete vs.continuousThe most complex general class of environments are inaccessible,non-deterministic,non-episodic,dynamic,and continuous.2024/5/25 周六9强化学习史忠植强化学习问题强化学习问题oAgent-environment interactionnStates,Actions,RewardsoTo define a finite MDPnstate and action sets:S and Anone-step“dynamics”defined by transition probabilities(Markov Property):nreward probabilities:EnvironmentactionstaterewardRLAgent2024/5/25 周六10强化学习史忠植与监督学习对比与监督学习对比oReinforcement Learning Learn from interactionnlearn from its own experience,and the objective is to get as much reward as possible.The learner is not told which actions to take,but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs(“actions”)Training Info =evaluations(“rewards”/“penalties”)oSupervised Learning Learn from examples provided by a knowledgable external supervisor.2024/5/25 周六11强化学习史忠植强化学习要素强化学习要素oPolicy:stochastic rule for selecting actionsoReturn/Reward:the function of future rewards agent tries to maximizeoValue:what is good because it predicts rewardoModel:what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my goalIs I can getIs my method2024/5/25 周六12强化学习史忠植在策略在策略下的下的BellmanBellman公式公式The basic idea:So:Or,without the expectation operator:is the discount rate2024/5/25 周六13强化学习史忠植Bellman最优策略公式2024/5/25 周六14强化学习史忠植MARKOV DECISION PROCESS k-armed bandit gives immediate reward DELAYED REWARD?Characteristics of MDP:a set of states :Sa set of actions:Aa reward function:R:S x A RA state transition function:T:S x A (S)T(s,a,s):probability of transition from s to s using action a2024/5/25 周六15强化学习史忠植MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)2024/5/25 周六16强化学习史忠植MDP Graphical Representation,:T(s,action,s )Similarity to Hidden Markov Models(HMMs)2024/5/25 周六17强化学习史忠植动态规划Dynamic Programming-ProblemoA discrete-time dynamic systemnStates 1,n+termination state 0nControl U(i)nTransition Probability pij(u)oAccumulative cost structureoPolicies2024/5/25 周六18强化学习史忠植oFinite Horizon ProblemoInfinite Horizon ProblemoValue Iteration动态规划Dynamic Programming Iterative Solution 2024/5/25 周六19强化学习史忠植动态规划中的策略迭代/值迭代 policy evaluationpolicy improvement“greedification”Policy IterationValue Iteration2024/5/25 周六20强化学习史忠植动态规划方法TTTTTTTTTTTTT2024/5/25 周六21强化学习史忠植自适应动态规划(ADP)Idea:use the constraints(state transition probabilities)between states to speed learning.Solve=value determination.No maximization over actions because agent is passive unlike in value iteration.using DPLarge state spacee.g.Backgammon:1050 equations in 1050 variables2024/5/25 周六22强化学习史忠植Value Iteration AlgorithmAN ALTERNATIVE ITERATION:(Singh,1993)(Important for model free learning)Stop Iteration when V(s)differs less than.Policy difference ratio=2/(1-)(Williams&Baird 1993b)2024/5/25 周六23强化学习史忠植Policy Iteration Algorithm Policies converge faster than values.Why faster convergence?2024/5/25 周六24强化学习史忠植Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost=0.04(Temporal)credit assignment problem sparse reinforcement problemOffline alg:action sequences determined ex anteOnline alg:action sequences is conditional on observations along the way;Important in stochastic environment(e.g.jet flying)2024/5/25 周六25强化学习史忠植Reinforcement Learning M=0.8 in direction you want to go 0.2 in perpendicular 0.1 left0.1 rightPolicy:mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable(accessible):percept identifies the statePartially observableMarkov property:Transition probabilities depend on state only,not on the path to the state.Markov decision problem(MDP).Partially observable MDP(POMDP):percepts does not have enough info to identify transition probabilities.2024/5/25 周六26强化学习史忠植Model Free MethodsModels of the environment:T:S x A (S)and R:S x A RDo we know them?Do we have to know them?oMonte Carlo MethodsoAdaptive Heuristic CriticoQ Learning2024/5/25 周六27强化学习史忠植Monte Carlo策略策略评价评价oGoal:learn Vp p(s)under P and R are unknown in advanceoGiven:some number of episodes under p p which contain soIdea:Average returns observed after visits to soEvery-Visit MC:average returns for every time s is visited in an episodeoFirst-visit MC:average returns only for first time s is visited in an episodeoBoth converge asymptotically123452024/5/25 周六28强化学习史忠植蒙特卡罗方法 Monte Carlo Methods oIdea:Hold statistics about rewards for each state Take the average This is the V(s)oBased only on experience oAssumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.)oIncremental in episode-by-episode sense not step-by-step sense.2024/5/25 周六29强化学习史忠植Problem:Unvisited pairs(problem of maintaining exploration)For every make sure that:P(selected as a start state and action)0 (Assumption of exploring starts )蒙特卡罗方法蒙特卡罗方法 2024/5/25 周六30强化学习史忠植Monte Carlo方法TTTTTTTTTTTTTTTTTTTT2024/5/25 周六31强化学习史忠植蒙特卡罗控制蒙特卡罗控制How to select Policies:(Similar to policy evaluation)MC policy iteration:Policy evaluation using MC methods followed by policy improvement Policy improvement step:greedify with respect to value(or action-value)function2024/5/25 周六32强化学习史忠植时序差分学习时序差分学习 Temporal-Differencetarget:the actual return after time ttarget:an estimate of the return2024/5/25 周六33强化学习史忠植时序差分学习时序差分学习(TD)Idea:Do ADP backups on a per move basis,not for the whole state space.Theorem:Average value of U(i)converges to the correct value.Theorem:If is appropriately decreased as a function of times a state is visited(=Ni),then U(i)itself converges to the correct value2024/5/25 周六34强化学习史忠植时序差分学习时序差分学习 TDTTTTTTTTTTTTTTTTTTTT2024/5/25 周六35强化学习史忠植TD()A Forward ViewoTD()is a method for averaging all n-step backups nweight by n-1(time since visitation)n-return:oBackup using-return:2024/5/25 周六36强化学习史忠植时序差分学习算法时序差分学习算法 TD()Idea:update from the whole epoch,not just on state transition.Special cases:=1:Least-mean-square(LMS),Mont Carlo=0:TDIntermediate choice of (between 0 and 1)is best.Interplay with 2024/5/25 周六37强化学习史忠植时序差分学习算法时序差分学习算法2024/5/25 周六38强化学习史忠植时序差分学习算法收敛性时序差分学习算法收敛性TD()Theorem:Converges w.p.1 under certain boundaries conditions.Decrease i(t)s.t.In practice,often a fixed is used for all i and t.2024/5/25 周六39强化学习史忠植时序差分学习时序差分学习 TD2024/5/25 周六40强化学习史忠植Q-LearningWatkins,1989oEstimate the Q-function using some approximator (for example,linear regression or neural networks or decision trees etc.).oDerive the estimated policy as an argument of the maximum of the estimated Q-function.oAllow different parameter vectors at different time points.oLet us illustrate the algorithm with linear regression as the approximator,and of course,squared error as the appropriate loss function.2024/5/25 周六41强化学习史忠植Q-learningQ(a,i)Direct approach(ADP)would require learning a model .Q-learning does not:Do this update after each state transition:2024/5/25 周六42强化学习史忠植ExplorationTradeoff between exploitation(control)and exploration(identification)Extremes:greedy vs.random acting(n-armed bandit models)Q-learning converges to optimal Q-values if*Every state is visited infinitely often(due to exploration),*The action selection becomes greedy as time approaches infinity,and*The learning rate is decreased fast enough but not too fast(as we discussed in TD learning)2024/5/25 周六43强化学习史忠植Common exploration methods1.In value iteration in an ADP agent:Optimistic estimate of utility U+(i)2.-greedy methodNongreedy actions Greedy action3.Boltzmann explorationExploration funcR+if nNu o.w.2024/5/25 周六44强化学习史忠植Q-Learning AlgorithmoSetoForoThe estimated policy satisfies2024/5/25 周六45强化学习史忠植What is the intuition?oBellman equation gives oIf and the training set were infinite,then Q-learning minimizes which is equivalent to minimizing2024/5/25 周六46强化学习史忠植A-Learning Murphy,2003 and Robins,2004oEstimate the A-function(advantages)using some approximator,as in Q-learning.oDerive the estimated policy as an argument of the maximum of the estimated A-function.oAllow different parameter vectors at different time points.oLet us illustrate the algorithm with linear regression as the approximator,and of course,squared error as the appropriate loss function.2024/5/25 周六47强化学习史忠植A-Learning Algorithm(Inefficient Version)oForoThe estimated policy satisfies2024/5/25 周六48强化学习史忠植Differences between Q and A-learningoQ-learningnAt time t we model the main effects of the history,(St,At-1)and the action At and their interactionnOur Yt-1 is affected by how we modeled the main effect of the history in time t,(St,At-1)oA-learningnAt time t we only model the effects of At and its interaction with(St,At-1)nOur Yt-1 does not depend on a model of the main effect of the history in time t,(St,At-1)2024/5/25 周六49强化学习史忠植Q-Learning Vs.A-LearningoRelative merits and demerits are not completely known till now.oQ-learning has low variance but high bias.oA-learning has high variance but low bias.oComparison of Q-learning with A-learning involves a bias-variance trade-off.2024/5/25 周六50强化学习史忠植POMDP部分感知马氏决策过程 oRather than observing the state we observe some function of the state.oOb Observable functiona random variable for each states.oProblem:different states may look similarThe optimal strategy might need to consider the history.2024/5/25 周六51强化学习史忠植Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上，是观察的集合，即系统可以感知的世界状态集合，观察函数：SAPD（）。系统在采取动作a转移到状态s时，观察函数确定其在可能观察上的概率分布。记为（s,a,o）。1 可以是S的子集，也可以与S无关2024/5/25 周六52强化学习史忠植POMDPsWhat if state information(from sensors)is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.2024/5/25 周六53强化学习史忠植POMDPs A Solution StrategySE:Belief State Estimator(Can be based on HMM):MDP Techniques2024/5/25 周六54强化学习史忠植POMDP_信度状态方法oIdea:Given a history of actions and observable value,we compute a posterior distribution for the state we are in(belief state)oThe belief-state MDPnStates:distribution over S(states of the POMDP)nActions:as in POMDPnTransition:the posterior distribution(given the observation)Open Problem:How to deal with the continuous distribution?2024/5/25 周六55强化学习史忠植The Learning Process of Belief MDP2024/5/25 周六56强化学习史忠植Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采用直接采用标标准的准的强强化学化学习习算法算法Simple memory based approaches使用使用k个个历历史史观观察表示当前状察表示当前状态态UDM(Utile Distinction Memory)分解状分解状态态，构建有限状，构建有限状态态机模型机模型NSM(Nearest Sequence Memory)存存储储状状态历态历史，史，进进行距离度量行距离度量USM(Utile Suffix Memory)综综合合UDM和和NSM两种方法两种方法Recurrent-Q使用循使用循环环神神经经网网络进络进行状行状态预测态预测策略搜索Evolutionary algorithms使用使用遗传遗传算法直接算法直接进进行策略搜索行策略搜索Gradient ascent method使用梯度下降（上升）法搜索使用梯度下降（上升）法搜索2024/5/25 周六57强化学习史忠植强化学习中的函数估计RLFASubset of statesValue estimate as targetsV(s)Generalization of the value function to the entire state spaceis the TD operator.is the function approximation operator.2024/5/25 周六58强化学习史忠植并行两个迭代过程o值函数迭代过程o值函数逼近过程How to construct the M function?Using state cluster,interpolation,decision tree or neural network?2024/5/25 周六59强化学习史忠植oFunction Approximator:V(s)=f(s,w)oUpdate:Gradient-descent Sarsa:w w+a rt+1+g Q(st+1,at+1)-Q(st,at)w f(st,at,w)weight vectorStandard gradienttarget valueestimated valueOpen Problem:How to design the non-liner FA system which can converge with the incremental instances?2024/5/25 周六60强化学习史忠植Semi-MDPDiscrete timeHomogeneous discountContinuous timeDiscrete eventsInterval-dependent discountDiscrete timeDiscrete eventsInterval-dependent discountA discrete-time SMDP overlaid on an MDPCan be analyzed at either level.One approach to Temporal Hierarchical RL2024/5/25 周六61强化学习史忠植The equations2024/5/25 周六62强化学习史忠植Multi-agent MDPoDistributed RLoMarkov GameoBest ResponseEnvironmentactionstaterewardRLAgentRLAgent2024/5/25 周六63强化学习史忠植三种观点三种观点问题空间主要方法算法准则合作多agent强化学习分布、同构、分布、同构、合作合作环环境境交交换换状状态态提高学提高学习习收收敛敛速度速度交交换经验换经验交交换换策略策略交交换换建建议议基于平衡解多agent强化学习同构或异构、同构或异构、合作或合作或竞竞争争环环境境极小极大极小极大-Q-Q理性和收理性和收敛敛性性NASH-QNASH-QCE-QCE-QWoLFWoLF最佳响应多agent强化学习异构、异构、竞竞争争环环境境PHCPHC收收敛敛性和不性和不遗遗憾性憾性IGAIGAGIGAGIGAGIGA-WoLFGIGA-WoLF2024/5/25 周六64强化学习史忠植马尔可夫对策马尔可夫对策o在在n个个agent的系统中，定义离散的状态集的系统中，定义离散的状态集S（即对策集合（即对策集合G），），agent动作动作集集Ai的集合的集合A,联合奖赏函数联合奖赏函数Ri：SA1An 和状态转移函数和状态转移函数P：SA1AnPD（S）。2024/5/25 周六65强化学习史忠植基于平衡解方法的强化学习基于平衡解方法的强化学习Open Problem:Nash equilibrium or other equilibrium is enough?The optimal policy in single game is Nash equilibrium.2024/5/25 周六66强化学习史忠植Applications of RLoCheckers Samuel 59oTD-Gammon Tesauro 92oWorlds best downpeak elevator dispatcher Crites at al 95oInventory management Bertsekas et al 95n10-15%better than industry standardoDynamic channel assignment Singh&Bertsekas,Nie&Haykin 95nOutperforms best heuristics in the literatureoCart-pole Michie&Chambers 68-with bang-bang controloRobotic manipulation Grupen et al.93-oPath planningoRobot docking Lin 93oParkingoFootball Stone98oTetrisoMultiagent RL Tan 93,Sandholm&Crites 95,Sen 94-,Carmel&Markovitch

展开阅读全文