Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Anirudh Satheesh; Vaneet Aggarwal

arXiv:2602.08000·cs.LG·February 10, 2026

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Anirudh Satheesh, Vaneet Aggarwal

PDF

Open Access

TL;DR

This paper introduces a primal-dual actor-critic algorithm for unichain average-reward constrained MDPs, providing finite-time regret bounds without relying on mixing-time assumptions, thus broadening applicability.

Contribution

It develops a novel algorithm that handles unichain dynamics with general parameterizations, achieving order-optimal regret bounds without ergodicity assumptions.

Findings

01

Finite-time regret bounds of ext{O}(\u221a{T}) scale.

02

Handles unichain dynamics without mixing-time oracles.

03

Extends theoretical guarantees to broader CMDP classes.

Abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O} (T)$ , up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control