Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time
Xiang Ji, Gen Li

TL;DR
This paper introduces the first regret-optimal model-free reinforcement learning algorithm for discounted MDPs that is efficient in both sample use and burn-in time, using variance reduction and adaptive policy switching.
Contribution
It presents a novel regret-optimal, model-free RL algorithm for discounted MDPs that requires a short burn-in time and low computational resources.
Findings
Achieves regret optimality in discounted MDPs
Requires significantly less burn-in time than previous algorithms
Uses variance reduction and adaptive policy switching techniques
Abstract
A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
