Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon   MDP

Kefan Dong; Yuanhao Wang; Xiaoyu Chen; Liwei Wang

arXiv:1901.09311·cs.LG·September 30, 2019·37 cites

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Kefan Dong, Yuanhao Wang, Xiaoyu Chen, Liwei Wang

PDF

Open Access

TL;DR

This paper demonstrates that a Q-learning algorithm with UCB exploration is sample efficient for infinite-horizon discounted MDPs, achieving near-optimal exploration complexity without a generative model.

Contribution

It adapts Q-learning with UCB exploration to infinite-horizon MDPs and proves a tighter sample complexity bound matching lower bounds, improving previous results.

Findings

01

Sample complexity bound of ilde{O}(SA/(\epsilon^2(1-\gamma)^7))

02

Improves upon previous bound of ilde{O}(SA/(\epsilon^4(1-\gamma)^8))

03

Matches lower bounds in ext{S}, ext{A}, and ext{ extbackslash epsilon} factors

Abstract

A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by $\tilde{O} (\frac{S A}{ϵ ^{2} ( 1 - γ ) ^{7}})$ . This improves the previously best known result of $\tilde{O} (\frac{S A}{ϵ ^{4} ( 1 - γ ) ^{8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $ϵ$ as well as $S$ and $A$ except for logarithmic factors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms

MethodsQ-Learning