Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic   Algorithm for Constrained Markov Decision Processes

Sihan Zeng; Thinh T. Doan; Justin Romberg

arXiv:2110.11383·math.OC·November 21, 2024·1 cites

Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes

Sihan Zeng, Thinh T. Doan, Justin Romberg

PDF

Open Access

TL;DR

This paper analyzes the finite-time convergence of an online primal-dual natural actor-critic algorithm for constrained Markov decision processes, showing it converges to the global optimum at a rate of O(1/K^{1/6}).

Contribution

It provides the first finite-time convergence analysis for an online primal-dual actor-critic method applied to CMDPs, with theoretical guarantees and numerical validation.

Findings

01

Convergence rate of O(1/K^{1/6}) for optimality gap and constraint violation.

02

Algorithm effectively solves constrained MDPs with proven finite-time guarantees.

03

Numerical simulations confirm the theoretical results.

Abstract

We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control