Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs
Washim Uddin Mondal, Vaneet Aggarwal

TL;DR
This paper introduces a new primal-dual accelerated natural policy gradient algorithm for constrained MDPs, achieving improved last-iterate convergence guarantees with specific sample complexities depending on policy class completeness.
Contribution
It proposes the PDR-ANPG algorithm with entropy and quadratic regularizers, providing the first last-iterate convergence guarantees for general parameterized policies in CMDPs.
Findings
Achieves last-iterate $ ilde{O}(rac{1}{ ext{epsilon}^4})$ sample complexity for complete policies.
Reduces sample complexity to $ ilde{O}(rac{1}{ ext{epsilon}^2})$ when the policy class is incomplete.
Improves upon existing state-of-the-art guarantees for parameterized CMDPs.
Abstract
This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, , PDR-ANPG achieves a last-iterate optimality gap and constraint violation with a sample complexity of . If the class is incomplete (), then the sample complexity reduces to for . Moreover, for complete policies with , our algorithm achieves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
