Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs
Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

TL;DR
This paper introduces variance-aware regret bounds for undiscounted reinforcement learning in MDPs, improving existing bounds by incorporating local variance of the bias function, leading to tighter performance guarantees.
Contribution
It provides a novel analysis of the KL-UCRL algorithm with regret bounds that depend on local variance, enhancing understanding of regret in ergodic MDPs.
Findings
Regret bound scales as O( S V^* T) for ergodic MDPs.
The new bound improves upon the previous O(DS A T^{1/2}) bound.
In some benchmarks, the new bound offers an order of magnitude improvement.
Abstract
The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as for this algorithm for ergodic MDPs, where denotes the number of states and where is the variance of the bias function with respect to the next-state distribution following action in state . The resulting bound improves upon the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
