Improved Exploration in Factored Average-Reward MDPs
Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard

TL;DR
This paper introduces DBN-UCRL, a new algorithm for regret minimization in factored average-reward MDPs, which leverages the known factorization to achieve improved theoretical regret bounds and empirical performance.
Contribution
The paper presents DBN-UCRL, a novel regret minimization strategy for factored MDPs that improves upon existing bounds by exploiting the known factorization structure.
Findings
DBN-UCRL achieves lower regret bounds than previous algorithms.
The regret of DBN-UCRL scales favorably with the size of the factored state space.
Numerical experiments show substantial empirical improvement over existing methods.
Abstract
We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space and the state-space admit the respective factored forms of and , and the transition and reward functions are factored over and . Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Advanced Control Systems Optimization
