Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in   MDPs

Mohammad Sadegh Talebi; Odalric-Ambrym Maillard

arXiv:1803.01626·stat.ML·March 6, 2018·5 cites

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Mohammad Sadegh Talebi, Odalric-Ambrym Maillard

PDF

Open Access

TL;DR

This paper introduces variance-aware regret bounds for undiscounted reinforcement learning in MDPs, improving existing bounds by incorporating local variance of the bias function, leading to tighter performance guarantees.

Contribution

It provides a novel analysis of the KL-UCRL algorithm with regret bounds that depend on local variance, enhancing understanding of regret in ergodic MDPs.

Findings

01

Regret bound scales as O( S V^* T) for ergodic MDPs.

02

The new bound improves upon the previous O(DS A T^{1/2}) bound.

03

In some benchmarks, the new bound offers an order of magnitude improvement.

Abstract

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $O (S \sum_{s, a} V_{s, a}^{⋆} T)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where $V_{s, a}^{⋆}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$ . The resulting bound improves upon the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization