Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins
Qian Zuo, Zhiyong Wang, Fengxiang He

TL;DR
This paper introduces FlexDOME, a novel algorithm for safe online reinforcement learning in CMDPs, achieving near-constant constraint violations and sublinear regret through innovative safety margin scheduling and convergence analysis.
Contribution
The paper presents the first provably near-constant violation and last-iterate convergence algorithm for safe online CMDPs using time-varying safety margins.
Findings
FlexDOME achieves near-constant constraint violation.
FlexDOME guarantees sublinear regret.
Experiments validate theoretical guarantees.
Abstract
We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
