Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition
Zihan Zhang, Yuan Zhou, Xiangyang Ji

TL;DR
This paper introduces a model-free reinforcement learning algorithm, UCB-Advantage, that achieves near-optimal regret bounds in finite-horizon MDPs, matching the performance of model-based methods and lower bounds.
Contribution
The paper presents UCB-Advantage, a novel model-free RL algorithm with improved regret bounds and applicability to concurrent learning, surpassing previous methods.
Findings
Achieves $ ilde{O}( oot{2}H^2SAT)$ regret bound
Matches the best known model-based algorithms and lower bounds
Has low local switching cost and supports concurrent RL
Abstract
We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with states, actions, and episode length . We propose a model-free algorithm UCB-Advantage and prove that it achieves regret where and is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
