Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs
Zihan Zhou, Honghao Wei, Lei Ying

TL;DR
This paper introduces a model-free algorithm called PRI for online CMDPs that achieves low regret and constraint violation while reliably identifying an approximately optimal policy, improving upon previous bounds.
Contribution
The paper proposes the PRI algorithm, leveraging the limited stochasticity property of CMDPs, to achieve regret and violation bounds with high-probability optimal policy identification.
Findings
PRI achieves $ ilde{O}(H oot K)$ regret and constraint violation.
It guarantees high-probability approximate optimal policy identification.
A matching lower bound shows the bounds are nearly optimal.
Abstract
This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with constraints, there exists an optimal policy with at most stochastic decisions. The proposed algorithm first…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is well written. Furthermore, the authors propose the first model-free algorithm achieving \sqrt{T} regret and violations in CMDPs which outputs a near optimal policy, which is a non-trivial result.
1) Since the paper refers to deterministic CMDP (and even assuming the generalisation to stochastic rewards and constraints to be trivial), the notion of violation proposed seems to be weak. Indeed, [Efroni et al., 2020] model-based methods, achieves optimal sublinear violation when the cancellations between episodes are not possible. 2) The algorithm strongly relies on the Triple-Q algorithm, employing it as subroutine. Thus, the algorithmic novelty is partial. 3) The assumption that the CMDP
1. The paper is well-written. 2.
My only suggestion for improving the clarity is to add a short review of Triple-Q to make the paper more self-contained. Other questions are discussed in the next box.
- Exploiting specific structural properties of policies and occupancy measures in the constrained MDP case, the paper proposes an effective model-free learning algorithm with a good regret and acceptable constraint violation performance. Instead of best-iterate convergence, a stronger regret result was proved. These results may be good contributions. - The paper is extremely well-written. The algorithm design and analysis were discussed very clearly.
- Although the algorithm achieves a better regret bound compared to Triple-Q in (Wei et al., 2022a), this improvement comes at the expense of increased constraint violation. Is there a tradeoff between regret and constraint violation? If so, is it possible to achieve this tradeoff by using different hyperparameters?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications
