Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization
Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi,, Nicola Gatti

TL;DR
This paper introduces an efficient policy optimization algorithm for constrained MDPs that achieves the optimal sublinear strong regret and violation bounds of rac{1}{2} d with a primal-dual scheme, improving upon previous methods.
Contribution
It presents the first policy optimization method achieving optimal rac{1}{2} d bounds for strong regret and violation in constrained MDPs.
Findings
Achieves rac{1}{2} d d bounds for strong regret and violation.
Uses a primal-dual scheme with policy optimization and UCB-like dual updates.
Outperforms previous algorithms with suboptimal bounds.
Abstract
We study online learning in \emph{constrained MDPs} (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. Differently from their standard (weak) counterparts, these metrics do not allow negative terms to compensate positive ones, raising considerable additional challenges. Efroni et al. (2020) were the first to propose an algorithm with sublinear strong regret and strong violation, by exploiting linear programming. Thus, their algorithm is highly inefficient, leaving as an open problem achieving sublinear bounds by means of policy optimization methods, which are much more efficient in practice. Very recently, Muller et al. (2024) have partially addressed this problem by proposing a policy optimization method that allows to attain strong regret/violation. This still leaves open the question of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
