Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space
Jonatha Anselmi (POLARIS, LIG), Bruno Gaujal (POLARIS, LIG),, Louis-S\'ebastien Rebuffi (POLARIS, LIG, UGA)

TL;DR
This paper demonstrates that in certain structured MDPs, reinforcement learning regret bounds can be made independent of the state space size by exploiting the problem's structure, challenging traditional complexity assumptions.
Contribution
The authors show that for MDPs with a birth and death structure, the regret bound of a modified UCRL2 algorithm is independent of the number of states, breaking traditional dependence on the diameter.
Findings
Regret bound is (\, ext{E}_2 ext{A}T) with ext{E}_2 ext{A} bounded independently of states.
Traditional bounds suggest inefficiency due to large diameter; this work overcomes that.
The approach relies on analyzing non-uniform state visitations in structured MDPs.
Abstract
In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the \emph{diameter} of the MDP is , where is the number of states. Therefore, the existing lower and upper bounds on the regret at time, of order for MDPs with states and actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by where is related to the weighted second moment of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSmart Grid Energy Management · Age of Information Optimization · Advanced Bandit Algorithms Research
