The regret lower bound for communicating Markov Decision Processes
Victor Boone, Odalric-Ambrym Maillard

TL;DR
This paper extends regret lower bounds to communicating MDPs, revealing complex exploration behaviors and navigation constraints, and provides computational hardness results and an approximation algorithm.
Contribution
It introduces a new regret lower bound framework for communicating MDPs, highlighting co-exploration and navigation constraints, with complexity analysis and an approximation method.
Findings
Lower bound involves an optimization problem reflecting exploration needs.
Computational hardness results show the problem is $ ext{Sigma}_2^ ext{P}$-hard.
An algorithm for approximating the lower bound is proposed.
Abstract
This paper is devoted to the extension of the regret lower bound beyond ergodic Markov decision processes (MDPs) in the problem dependent setting. While the regret lower bound for ergodic MDPs is well-known and reached by tractable algorithms, we prove that the regret lower bound becomes significatively more complex in communicating MDPs. Our lower bound revisits the necessary explorative behavior of consistent learning agents and further explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones, a phenomenon that we refer to as co-exploration. In tandem, we show that these two explorative and co-explorative behaviors are intertwined with navigation constraints obtained by scrutinizing the navigation structure at logarithmic scale. The resulting lower bound is expressed as the solution of an optimization problem that, in many standard classes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
