Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret
Kihyun Yu, Beomhan Baek, Dabeen Lee

TL;DR
This paper establishes strong duality for weakly communicating average-reward CMDPs and introduces a primal-dual clipped value iteration algorithm that achieves improved regret bounds.
Contribution
It proves strong duality in a challenging setting and develops a novel algorithm with better theoretical guarantees for learning CMDPs.
Findings
Strong duality holds for weakly communicating average-reward CMDPs.
The proposed algorithm achieves $ ilde{O}(T^{2/3})$ regret and constraint violation bounds.
The approach extends clipped value iteration to constrained, weakly communicating settings.
Abstract
We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of , improving upon the best known bounds, where denotes the number of interactions. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
