Improved Exploration in Factored Average-Reward MDPs

Mohammad Sadegh Talebi; Anders Jonsson; Odalric-Ambrym Maillard

arXiv:2009.04575·cs.LG·March 12, 2021·1 cites

Improved Exploration in Factored Average-Reward MDPs

Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard

PDF

Open Access

TL;DR

This paper introduces DBN-UCRL, a new algorithm for regret minimization in factored average-reward MDPs, which leverages the known factorization to achieve improved theoretical regret bounds and empirical performance.

Contribution

The paper presents DBN-UCRL, a novel regret minimization strategy for factored MDPs that improves upon existing bounds by exploiting the known factorization structure.

Findings

01

DBN-UCRL achieves lower regret bounds than previous algorithms.

02

The regret of DBN-UCRL scales favorably with the size of the factored state space.

03

Numerical experiments show substantial empirical improvement over existing methods.

Abstract

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $X$ and the state-space $S$ admit the respective factored forms of $X = \otimes_{i = 1}^{n} X_{i}$ and $S = \otimes_{i = 1}^{m} S_{i}$ , and the transition and reward functions are factored over $X$ and $S$ . Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Advanced Control Systems Optimization