Online Markov Decision Processes with Terminal Law Constraints

Bianca Marin Moreno (EDF R\&D; Thoth; FiME Lab); Margaux Br\'eg\`ere (EDF R\&D; LPSM (UMR\_8001)); Pierre Gaillard (Thoth); Nadia Oudjane (EDF R\&D; FiME Lab)

arXiv:2601.07492·math.OC·January 13, 2026

Online Markov Decision Processes with Terminal Law Constraints

Bianca Marin Moreno (EDF R\&D, Thoth, FiME Lab), Margaux Br\'eg\`ere (EDF R\&D, LPSM (UMR\_8001)), Pierre Gaillard (Thoth), Nadia Oudjane (EDF R\&D, FiME Lab)

PDF

Open Access

TL;DR

This paper introduces a reset-free, periodic framework for online Markov decision processes with terminal law constraints, providing the first non-asymptotic guarantees for multi-agent settings without resets.

Contribution

It formalizes the problem of finding optimal periodic policies with terminal constraints and proposes algorithms with sublinear regret guarantees for multi-agent MDPs.

Findings

01

Achieved sublinear periodic regret of order $ ilde O(T^{3/4})$

02

First non-asymptotic guarantees for reset-free multi-agent learning

03

Introduced the periodic regret measure for evaluating policies

Abstract

Traditional reinforcement learning usually assumes either episodic interactions with resets or continuous operation to minimize average or cumulative loss. While episodic settings have many theoretical results, resets are often unrealistic in practice. The infinite-horizon setting avoids this issue but lacks non-asymptotic guarantees in online scenarios with unknown dynamics. In this work, we move towards closing this gap by introducing a reset-free framework called the periodic framework, where the goal is to find periodic policies: policies that not only minimize cumulative loss but also return the agents to their initial state distribution after a fixed number of steps. We formalize the problem of finding optimal periodic policies and identify sufficient conditions under which it is well-defined for tabular Markov decision processes. To evaluate algorithms in this framework, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Game Theory and Applications