Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
Kefan Dong, Yuanhao Wang, Xiaoyu Chen, Liwei Wang

TL;DR
This paper demonstrates that a Q-learning algorithm with UCB exploration is sample efficient for infinite-horizon discounted MDPs, achieving near-optimal exploration complexity without a generative model.
Contribution
It adapts Q-learning with UCB exploration to infinite-horizon MDPs and proves a tighter sample complexity bound matching lower bounds, improving previous results.
Findings
Sample complexity bound of ilde{O}(SA/(\epsilon^2(1-\gamma)^7))
Improves upon previous bound of ilde{O}(SA/(\epsilon^4(1-\gamma)^8))
Matches lower bounds in ext{S}, ext{A}, and ext{ extbackslash epsilon} factors
Abstract
A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by . This improves the previously best known result of in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of as well as and except for logarithmic factors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
MethodsQ-Learning
