Minimax Regret Bounds for Reinforcement Learning
Mohammad Gheshlaghi Azar, Ian Osband, R\'emi Munos

TL;DR
This paper introduces a new reinforcement learning algorithm with provable regret bounds that improve upon previous results, matching the lower bound in certain regimes for finite horizon MDPs.
Contribution
It presents an optimistic value iteration method with tighter regret bounds, utilizing concentration inequalities and Bernstein-based exploration bonuses for better scaling.
Findings
Achieves a regret bound of O(√HSAT + H^2S^2A + H√T)
Matches the lower bound O(√HSAT) under certain conditions
Improves scaling in state space and horizon compared to prior algorithms
Abstract
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of where is the time horizon, the number of states, the number of actions and the number of time-steps. This result improves over the best previous known bound achieved by the UCRL2 algorithm of Jaksch et al., 2010. The key significance of our new results is that when and , it leads to a regret of that matches the established lower bound of up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
