Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo; Zhiyong Wang; Fengxiang He

arXiv:2602.10917·cs.LG·March 4, 2026

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo, Zhiyong Wang, Fengxiang He

PDF

Open Access

TL;DR

This paper introduces FlexDOME, a novel algorithm for safe online reinforcement learning in CMDPs, achieving near-constant constraint violations and sublinear regret through innovative safety margin scheduling and convergence analysis.

Contribution

The paper presents the first provably near-constant violation and last-iterate convergence algorithm for safe online CMDPs using time-varying safety margins.

Findings

01

FlexDOME achieves near-constant constraint violation.

02

FlexDOME guarantees sublinear regret.

03

Experiments validate theoretical guarantees.

Abstract

We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O} (1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization