Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

Francesco Emanuele Stradi; Anna Lunghi; Matteo Castiglioni; Alberto; Marchesi; Nicola Gatti

arXiv:2410.02269·cs.LG·February 10, 2025

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto, Marchesi, Nicola Gatti

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new policy optimization algorithm for constrained Markov decision processes that works effectively with bandit feedback, handling both stochastic and adversarial constraints with optimal regret and violation bounds.

Contribution

It presents the first bandit-feedback compatible best-of-both-worlds algorithm for CMDPs, improving efficiency over previous occupancy-measure-based methods.

Findings

01

Achieves () ilde{O}(\u0010 ext{ } ext{ }) regret and constraint violation for stochastic constraints.

02

Attains () ilde{O}( ext{ }) constraint violation in adversarial constraints.

03

Uses a policy optimization approach, enhancing computational efficiency.

Abstract

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $O (T)$ regret and constraint…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

1. Clarity on presentation. 2. Sound analysis.

Weaknesses

**Minor Comments** 1. I think the definition of the occupancy measure is wrong. See the questions below for more details. 2. The bonus update seems ambiguous (lines 312, 313). In this update, both LHS and RHS have the term $B_t(x, a)$, and the only initial condition given is $B_t(x_H, a)=0$, $\forall a$ (this is mentioned in the algorithm, not in the main text). I guess the layered structure of the state space is essential to fill up all entries of $B_t(\cdot, \cdot)$ using the stated update.

Reviewer 02Rating 6Confidence 2

Strengths

- The proposed method extends prior work to a more practical setting while retaining similar guarantees - The authors do a good job in presenting the problem setting and algorithm

Weaknesses

- It's unclear why is the useful to consider a weaker baseline since it is unclear what is the befit (from a practical standpoint) of only considering the optimal policy that respect the constraints at each episode - Having simple numerical experiments would help showcase the benefit of updating the primal based on a policy optimization approach vs the occupancy measure space.

Reviewer 03Rating 3Confidence 4

Strengths

The paper considers the CMDP setting with bandit feedback in both stochastic and adversarial settings, which is more practical. The authors provide theoretical guarantees on regret and constraint violation, with a specific analysis of the dependency on Slater’s condition. By focusing on policy space optimization rather than occupancy measures, the paper presents a more computationally efficient alternative for real-world applications where solving convex programs per episode is infeasible.

Weaknesses

- The paper’s theoretical results rely heavily on Condition 2. As discussed in Section 2.4, $\rho \in [0, H]$; however, it is unclear how this condition could hold even when $T = 1$. Even setting aside this issue, if Condition 2 does hold, it implies that $\rho$ must be a relatively large value as $T$ increases, which is likely to be the case in online settings. According to Theorem 4, the Lagrangian multiplier is bounded by $c/\rho^2$, resulting in a very small constant close to zero. In such c

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Auction Theory and Applications