Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes
Sihan Zeng, Thinh T. Doan, Justin Romberg

TL;DR
This paper analyzes the finite-time convergence of an online primal-dual natural actor-critic algorithm for constrained Markov decision processes, showing it converges to the global optimum at a rate of O(1/K^{1/6}).
Contribution
It provides the first finite-time convergence analysis for an online primal-dual actor-critic method applied to CMDPs, with theoretical guarantees and numerical validation.
Findings
Convergence rate of O(1/K^{1/6}) for optimality gap and constraint violation.
Algorithm effectively solves constrained MDPs with proven finite-time guarantees.
Numerical simulations confirm the theoretical results.
Abstract
We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
