Is Q-learning Provably Efficient?
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan

TL;DR
This paper proves that Q-learning with UCB exploration in episodic MDPs can achieve near-optimal sample efficiency, matching model-based approaches, and establishes the first such regret bound without a simulator.
Contribution
It provides the first theoretical regret bound for model-free Q-learning that matches the optimal sample complexity of model-based methods in episodic MDPs.
Findings
Q-learning with UCB achieves regret ( B3 H^3 SAT)
Sample efficiency matches that of model-based approaches up to a DA B3 factor
First analysis showing DA B3 regret for model-free RL without a simulator.
Abstract
Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret , where and are the numbers of states and actions, is the number of steps per episode, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
MethodsQ-Learning
