Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization
Anirudh Satheesh, Vaneet Aggarwal

TL;DR
This paper introduces a primal-dual actor-critic algorithm for unichain average-reward constrained MDPs, providing finite-time regret bounds without relying on mixing-time assumptions, thus broadening applicability.
Contribution
It develops a novel algorithm that handles unichain dynamics with general parameterizations, achieving order-optimal regret bounds without ergodicity assumptions.
Findings
Finite-time regret bounds of ext{O}(\u221a{T}) scale.
Handles unichain dynamics without mixing-time oracles.
Extends theoretical guarantees to broader CMDP classes.
Abstract
We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as , up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
