Online Markov Decision Processes with Terminal Law Constraints
Bianca Marin Moreno (EDF R\&D, Thoth, FiME Lab), Margaux Br\'eg\`ere (EDF R\&D, LPSM (UMR\_8001)), Pierre Gaillard (Thoth), Nadia Oudjane (EDF R\&D, FiME Lab)

TL;DR
This paper introduces a reset-free, periodic framework for online Markov decision processes with terminal law constraints, providing the first non-asymptotic guarantees for multi-agent settings without resets.
Contribution
It formalizes the problem of finding optimal periodic policies with terminal constraints and proposes algorithms with sublinear regret guarantees for multi-agent MDPs.
Findings
Achieved sublinear periodic regret of order $ ilde O(T^{3/4})$
First non-asymptotic guarantees for reset-free multi-agent learning
Introduced the periodic regret measure for evaluating policies
Abstract
Traditional reinforcement learning usually assumes either episodic interactions with resets or continuous operation to minimize average or cumulative loss. While episodic settings have many theoretical results, resets are often unrealistic in practice. The infinite-horizon setting avoids this issue but lacks non-asymptotic guarantees in online scenarios with unknown dynamics. In this work, we move towards closing this gap by introducing a reset-free framework called the periodic framework, where the goal is to find periodic policies: policies that not only minimize cumulative loss but also return the agents to their initial state distribution after a fixed number of steps. We formalize the problem of finding optimal periodic policies and identify sufficient conditions under which it is well-defined for tabular Markov decision processes. To evaluate algorithms in this framework, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Game Theory and Applications
