Is Q-learning Provably Efficient?

Chi Jin; Zeyuan Allen-Zhu; Sebastien Bubeck; Michael I. Jordan

arXiv:1807.03765·cs.LG·July 11, 2018·25 cites

Is Q-learning Provably Efficient?

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan

PDF

Open Access 1 Repo

TL;DR

This paper proves that Q-learning with UCB exploration in episodic MDPs can achieve near-optimal sample efficiency, matching model-based approaches, and establishes the first such regret bound without a simulator.

Contribution

It provides the first theoretical regret bound for model-free Q-learning that matches the optimal sample complexity of model-based methods in episodic MDPs.

Findings

01

Q-learning with UCB achieves regret ( B3 H^3 SAT)

02

Sample efficiency matches that of model-based approaches up to a DA B3 factor

03

First analysis showing DA B3 regret for model-free RL without a simulator.

Abstract

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O} (H^{3} S A T)$ , where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/intrepid
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms

MethodsQ-Learning