Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zihan Zhou; Honghao Wei; Lei Ying

arXiv:2309.15395·cs.LG·April 16, 2024

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zihan Zhou, Honghao Wei, Lei Ying

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a model-free algorithm called PRI for online CMDPs that achieves low regret and constraint violation while reliably identifying an approximately optimal policy, improving upon previous bounds.

Contribution

The paper proposes the PRI algorithm, leveraging the limited stochasticity property of CMDPs, to achieve regret and violation bounds with high-probability optimal policy identification.

Findings

01

PRI achieves $ ilde{O}(H oot K)$ regret and constraint violation.

02

It guarantees high-probability approximate optimal policy identification.

03

A matching lower bound shows the bounds are nearly optimal.

Abstract

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The paper is well written. Furthermore, the authors propose the first model-free algorithm achieving \sqrt{T} regret and violations in CMDPs which outputs a near optimal policy, which is a non-trivial result.

Weaknesses

1) Since the paper refers to deterministic CMDP (and even assuming the generalisation to stochastic rewards and constraints to be trivial), the notion of violation proposed seems to be weak. Indeed, [Efroni et al., 2020] model-based methods, achieves optimal sublinear violation when the cancellations between episodes are not possible. 2) The algorithm strongly relies on the Triple-Q algorithm, employing it as subroutine. Thus, the algorithmic novelty is partial. 3) The assumption that the CMDP

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper is well-written. 2.

Weaknesses

My only suggestion for improving the clarity is to add a short review of Triple-Q to make the paper more self-contained. Other questions are discussed in the next box.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- Exploiting specific structural properties of policies and occupancy measures in the constrained MDP case, the paper proposes an effective model-free learning algorithm with a good regret and acceptable constraint violation performance. Instead of best-iterate convergence, a stronger regret result was proved. These results may be good contributions. - The paper is extremely well-written. The algorithm design and analysis were discussed very clearly.

Weaknesses

- Although the algorithm achieves a better regret bound compared to Triple-Q in (Wei et al., 2022a), this improvement comes at the expense of increased constraint violation. Is there a tradeoff between regret and constraint violation? If so, is it possible to achieve this tradeoff by using different hyperparameters?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications