Logarithmic Regret of Exploration in Average Reward Markov Decision Processes
Victor Boone, Bruno Gaujal

TL;DR
This paper demonstrates that replacing the Doubling Trick with the Vanishing Multiplicative rule in average reward Markov decision processes improves regret bounds and episode management, leading to better theoretical and practical performance.
Contribution
It introduces the Vanishing Multiplicative rule for episode management, showing its advantages over the traditional Doubling Trick without modifying the core EVI algorithm.
Findings
Regret becomes logarithmic with VM rule during bad episodes
VM rule improves one-shot episode performance
Theoretical and empirical results favor VM over DT
Abstract
In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm's regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Reservoir Engineering and Simulation Methods · Process Optimization and Integration
