Optimal Strong Regret and Violation in Constrained MDPs via Policy   Optimization

Francesco Emanuele Stradi; Matteo Castiglioni; Alberto Marchesi,; Nicola Gatti

arXiv:2410.02275·cs.LG·October 4, 2024

Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi,, Nicola Gatti

PDF

TL;DR

This paper introduces an efficient policy optimization algorithm for constrained MDPs that achieves the optimal sublinear strong regret and violation bounds of rac{1}{2} d with a primal-dual scheme, improving upon previous methods.

Contribution

It presents the first policy optimization method achieving optimal rac{1}{2} d bounds for strong regret and violation in constrained MDPs.

Findings

01

Achieves rac{1}{2} d d bounds for strong regret and violation.

02

Uses a primal-dual scheme with policy optimization and UCB-like dual updates.

03

Outperforms previous algorithms with suboptimal bounds.

Abstract

We study online learning in \emph{constrained MDPs} (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. Differently from their standard (weak) counterparts, these metrics do not allow negative terms to compensate positive ones, raising considerable additional challenges. Efroni et al. (2020) were the first to propose an algorithm with sublinear strong regret and strong violation, by exploiting linear programming. Thus, their algorithm is highly inefficient, leaving as an open problem achieving sublinear bounds by means of policy optimization methods, which are much more efficient in practice. Very recently, Muller et al. (2024) have partially addressed this problem by proposing a policy optimization method that allows to attain $O (T^{0.93})$ strong regret/violation. This still leaves open the question of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.