Asymptotically optimal regret in communicating Markov decision processes
Victor Boone

TL;DR
This paper introduces a learning algorithm for communicating Markov decision processes that achieves asymptotically optimal regret by explicitly estimating a key constant, balancing exploration and exploitation effectively.
Contribution
The paper presents a novel algorithm that attains optimal regret in average reward MDPs by tracking the constant K(M) and addresses the challenge of its discontinuity with a regularization method.
Findings
Achieves regret of K(M) log(T) + o(log(T)) in communicating MDPs.
Develops a regularization mechanism to estimate K(M) accurately.
Demonstrates the discontinuity of the function K(M).
Abstract
In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process , our algorithm has regret where is the number of learning steps and is the best possible constant. This algorithm works by explicitly tracking the constant to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate with arbitrary precision from empirical data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
