Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Victor Boone, Zihan Zhang

TL;DR
This paper introduces a computationally efficient algorithm for average-reward MDPs that achieves minimax optimal regret bounds without prior knowledge of the bias span, improving upon previous methods.
Contribution
The paper presents the first tractable algorithm with minimax optimal regret for average-reward MDPs, utilizing a novel subroutine PMEVI that enhances existing algorithms.
Findings
Achieves regret of (\u007f( ext{sp}(h^*) S A T)) with high probability.
Does not require prior knowledge of the bias span (h^*).
Introduces PMEVI, a new subroutine for bias-constrained policy computation.
Abstract
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of , where is the span of the optimal bias function , is the size of the state-action space and the number of learning steps. Remarkably, our algorithm does not require prior information on . Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImbalanced Data Classification Techniques · Auction Theory and Applications · Consumer Market Behavior and Pricing
