咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Q-greedyUCB: a New Exploration... 收藏

Q-greedyUCB: a New Exploration Policy to Learn Resource-Efficient Scheduling

Q-greedyUCB: a New Exploration Policy to Learn Resource-Efficient Scheduling

作     者:Yu Zhao Joohyun Lee Wei Chen Yu Zhao;Joohyun Lee;Wei Chen

作者机构:Department of Electrical and Electronic EngineeringHanyang UniversityAnsan 15588South Korea Department of Electronic Engineering and Beijing National Research Center for Information Science and Technologythe Department of Electronic EngineeringTsinghua National Laboratory for Information Science and Technology(TNList)Tsinghua UniversityBeijing 100084China 

出 版 物:《China Communications》 (中国通信(英文版))

年 卷 期:2021年第18卷第6期

页      面:12-23页

核心收录:

学科分类:080904[工学-电磁场与微波技术] 12[管理学] 0809[工学-电子科学与技术(可授工学、理学学位)] 08[工学] 0810[工学-信息与通信工程] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 081104[工学-模式识别与智能系统] 080402[工学-测试计量技术及仪器] 0804[工学-仪器科学与技术] 081001[工学-通信与信息系统] 0835[工学-软件工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:This work was supported by the research fund of Hanyang University(HY-2019-N) This work was supported by the National Key Research&Development Program 2018YFA0701601 

主  题:reinforcement learning for average rewards infinite-horizon Markov decision process upper confidence bound queue scheduling 

摘      要:This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over *** this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve ***,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP *** mathematically prove that the Q-greedyUCB algorithm converges to an optimal *** results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative *** also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分