检索结果-南通市图书馆

policy iteration based Q-learning for linear nonzero-sum quadratic differential games

在线全文

学校读者我要写书评

暂无评论

Science China(Information Sciences) 2019年第5期62卷 195-213页

作者： Xinxing LI Zhihong PENG Li LIANG Wenzhong ZHA School of Automation Beijing Institute of Technology State Key Laboratory of Intelligent Control and Decision of Complex System Information Science Academy China Electronics Technology Group Corporation

In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning(RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Qlearning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

关键词： adaptive dynamic programming ADP Q-learning reinforcement learning RL linear nonzerosum quadratic differential games policy iteration PI off-policy

A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

在线全文

同方期刊数据库

学校读者我要写书评

暂无评论

Science China(Information Sciences) 2015年第12期58卷 147-161页

作者： WEI QingLai LIU DeRong State Key Laboratory of Management and Control for Complex Systems Institute of AutomationChinese Academy of Sciences School of Automation and Electrical Engineering University of Science and Technology Beijing

In this paper, a novel iterative Q-learning algorithm, called "policy iteration based deterministic Qlearning algorithm", is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming(ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically non-increasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are employed to implement the policy iteration based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.

关键词： adaptive critic designs adaptive dynamic programming approximate dynamic programming Q learning policy iteration neural networks nonlinear systems optimal control

Multiagent Reinforcement Learning:Rollout and policy iteration

在线全文

同方期刊数据库

学校读者我要写书评

暂无评论

IEEE/CAA Journal of Automatica Sinica 2021年第2期8卷 249-272页

作者： Dimitri Bertsekas the Arizona State University(ASU) TempeAZ 85281 USAand also with Massachusetts Institute of Technology(MIT)CambridgeMA 02139

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved *** is the simplest method of this type,where just one improved policy is *** can view PI as repeated application of rollout,where the rollout policy at each iteration serves as the base policy for the next *** contrast with PI,rollout has a robustness property:it can be applied on-line and is suitable for on-line ***,rollout can use as base policy one of the policies produced by PI,thereby improving on that *** is the type of scheme underlying the prominently successful Alpha Zero chess *** this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected(conceptually)by a separate *** is the class of multiagent problems where the agents have a shared objective function,and a shared and perfect state *** on a problem reformulation that trades off control space complexity with state space complexity,we develop an approach,whereby at every stage,the agents sequentially(one-at-a-time)execute a local rollout algorithm that uses a base policy,together with some coordinating information from the other *** amount of total computation required at every stage grows linearly with the number of *** contrast,in the standard rollout algorithm,the amount of total computation grows exponentially with the number of *** the dramatic reduction in required computation,we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout:it guarantees an improved performance relative to the base *** also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information,which is sufficient to maintain the c

关键词： Dynamic programming multiagent problems neuro-dynamic programming policy iteration reinforcement learning rollout

A policy iteration method for improving robot assembly trajectory efficiency

在线全文

学校读者我要写书评

暂无评论

Chinese Journal of Aeronautics 2023年第3期36卷 436-448页

作者： Qi ZHANG Zongwu XIE Baoshi CAO Yang LIU State Key Laboratory of Robotics and System Harbin Institute of TechnologyHarbin 150001China

Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of *** this paper,a policy iteration method based on reinforcement learning(RL)is proposed,by which the problem of trajectory efficiency improvement is constructed as an issue of RL-based objective ***,the projection relation between raw data and state-action space is established,and then a policy iteration initialization method is designed based on the projection to provide the initialization policy for *** iteration based on the protective policy is applied to continuously evaluating and optimizing the action-value function of all state-action pairs till the convergence is *** verify the feasibility and effectiveness of the proposed method,a noncontact demonstration experiment with human supervision is *** results show that the initialization policy and the generated policy can be obtained by the policy iteration method in a limited number of demonstrations.A comparison between the experiments with two different assembly tolerances shows that the convergent generated policy possesses higher trajectory efficiency than the conservative *** addition,this method can ensure safety during the training process and improve utilization efficiency of demonstration data.

关键词： Bolt assembly policy initialization policy iteration Reinforcement learning(RL) Robotic assembly Trajectory efficiency

policy iteration Approach to Average Optimal Control Problems for Boolean Control Networks

在线全文

学校读者我要写书评

暂无评论

Policy Iteration Approach to Average Optimal Control Problem...

Approximate policy iteration:a survey and somenew methods

第36届中国控制会议

作者： Yuhu Wu Ximing Sun Wei Wang Tielong Shen School of Control Science and Engineering Dalian University of Technology Department of Mechanical Engineering Sophia University

ISBN: (纸本)9781538629185

This paper investigates the average infinite horizon optimal control problem for Boolean control networks（BCNs）.Based on the semi-tensor product of matrices and Jordan decomposition technique,an optimality equation for the average infinite horizon problem of BCNs is *** resorting to Laurent series expression,a policy iteration algorithm,which can find the optimal solution in finite iteration steps,is *** applications,the output tracking problem for BCNs and the intervention problem of cAMP receptor protein are investigated.

关键词： Boolean control networks Logical networks Semi-tensor product Optimal control policy iteration

来源： cnki会议评论

在线全文

cnki会议

学校读者我要写书评

暂无评论

控制理论与应用（英文版） 2011年第3期9卷 310-335页

作者： Dimitri P.BERTSEKAS Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology

We consider the classical policy iteration method of dynamic programming(DP),where approximations and simulation are used to deal with the curse of *** survey a number of issues:convergence and rate of convergence of approximate policy evaluation methods,singularity and susceptibility to simulation noise of policy evaluation,exploration issues,constrained and enhanced policy iteration,policy oscillation and chattering,and optimistic and distributed policy *** discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches:projected equations and temporal differences(TD),and *** the context of these approaches,we survey two different types of simulation-based algorithms:matrix inversion methods,such as least-squares temporal difference(LSTD),and iterative methods,such as least-squares policy evaluation(LSPE) and TD(λ),and their scaled *** discuss a recent method,based on regression and regularization,which recti?es the unreliability of LSTD for nearly singular projected Bellman *** iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and *** discussion of policy improvement focuses on the role of policy oscillation and its effect on performance *** illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation,but when done by aggregation it does *** implies better error bounds and more regular performance for aggregation,at the expense of some loss of generality in cost function representation *** aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation,and is characterized by favorable error bounds.

关键词： Dynamic programming policy iteration Projected equation Aggregation Chattering Regularization

Optimal Tracking Control for Reconfigurable Manipulators Based on Critic-only policy iteration Algorithm

在线全文

学校读者我要写书评

暂无评论

Optimal Tracking Control for Reconfigurable Manipulators Bas...

A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

第36届中国控制会议

作者： Hongbing Xia Bo Zhao Yuanchun Li Department of Control Science and Engineering Changchun University of Technology The State Key Laboratory of Management and Control for Complex Systems Institute of AutomationChinese Academy of Sciences

ISBN: (纸本)9781538629185

This paper tackles the optimal tracking control problem for reconfigurable manipulators based on critic-only policy iteration（Co PI） algorithm. By system transformation, the optimal tracking control problem is transformed into an optimal regulation problem. The optimal tracking controller is composed of the desired controller and the approximate optimal feedback one. The desired controller is developed to maintain the desired tracking performance at the steady-state, while the approximate optimal feedback controller is designed to stabilize the tracking error dynamics in an optimal manner. Then, a critic neural network is used to estimate the optimal performance index function, and the optimal feedback control is obtained by the Co PI algorithm. The convergence of the proposed method is analyzed and it is shown that the closed-loop system based on Co PI is uniformly ultimately bounded by using the Lyapunov approach. Finally, simulation studies are given to show the effectiveness of the developed method.

关键词： Adaptive dynamic programming Reconfigurable manipulators Optimal tracking control policy iteration Neural networks

来源： cnki会议评论

在线全文

cnki会议

学校读者我要写书评

暂无评论

Science China Chemistry 2015年第12期58卷 143-157页

关键词： adaptive critic designs adaptive dynamic programming approximate dynamic programming Q-learning policy iteration neural networks nonlinear systems optimal control

Adaptive Optimal Control of Space Tether System for Payload Capture via policy iteration

维普期刊数据库评论

在线全文

维普期刊数据库

学校读者我要写书评

暂无评论

Transactions of Nanjing University of Aeronautics and Astronautics 2021年第4期38卷 560-570页

作者： FENG Yiting ZHANG Ming GUO Wenhao WANG Changqing School of Automation Northwestern Polytechnical UniversityXi’an 710129P.R.China Beijing Institute of Aerospace Systems Engineering Beijing 100076P.R.China

The libration control problem of space tether system(STS)for post-capture of payload is *** process of payload capture will cause tether swing and deviation from the nominal position,resulting in the failure of capture *** to unknown inertial parameters after capturing the payload,an adaptive optimal control based on policy iteration is developed to stabilize the uncertain dynamic system in the post-capture *** introducing integral reinforcement learning(IRL)scheme,the algebraic Riccati equation(ARE)can be online solved without known *** avoid computational burden from iteration equations,the online implementation of policy iteration algorithm is provided by the least-squares solution ***,the effectiveness of the algorithm is validated by numerical simulations.

关键词： space tether system(STS) payload capture policy iteration integral reinforcement learning(IRL) state feedback