Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes
Zihan Zhang, Qiaomin Xie

TL;DR
This paper introduces new provably efficient model-free reinforcement learning algorithms for infinite-horizon average-reward MDPs, achieving optimal regret bounds in online and simulator settings with novel techniques.
Contribution
The paper develops the first algorithms with optimal T-dependence for weakly communicating MDPs and introduces two new techniques for average-reward RL.
Findings
Achieves $ ilde{O}(S^5A^2 ext{sp}(h^*) oot{T})$ regret in online setting.
Sample complexity bounds close to the minimax lower bound in simulator setting.
Introduces value-difference estimation and confidence region construction techniques.
Abstract
We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves regret after steps, where is the size of state-action space, and the span of the optimal bias function. Our results are the first to achieve optimal dependence in for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an -optimal policy using samples, whereas the minimax lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization · Reinforcement Learning in Robotics
