Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback
Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto, Marchesi, Nicola Gatti

TL;DR
This paper introduces a new policy optimization algorithm for constrained Markov decision processes that works effectively with bandit feedback, handling both stochastic and adversarial constraints with optimal regret and violation bounds.
Contribution
It presents the first bandit-feedback compatible best-of-both-worlds algorithm for CMDPs, improving efficiency over previous occupancy-measure-based methods.
Findings
Achieves () ilde{O}(\u0010 ext{ } ext{ }) regret and constraint violation for stochastic constraints.
Attains () ilde{O}( ext{ }) constraint violation in adversarial constraints.
Uses a policy optimization approach, enhancing computational efficiency.
Abstract
We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves regret and constraint…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Clarity on presentation. 2. Sound analysis.
**Minor Comments** 1. I think the definition of the occupancy measure is wrong. See the questions below for more details. 2. The bonus update seems ambiguous (lines 312, 313). In this update, both LHS and RHS have the term $B_t(x, a)$, and the only initial condition given is $B_t(x_H, a)=0$, $\forall a$ (this is mentioned in the algorithm, not in the main text). I guess the layered structure of the state space is essential to fill up all entries of $B_t(\cdot, \cdot)$ using the stated update.
- The proposed method extends prior work to a more practical setting while retaining similar guarantees - The authors do a good job in presenting the problem setting and algorithm
- It's unclear why is the useful to consider a weaker baseline since it is unclear what is the befit (from a practical standpoint) of only considering the optimal policy that respect the constraints at each episode - Having simple numerical experiments would help showcase the benefit of updating the primal based on a policy optimization approach vs the occupancy measure space.
The paper considers the CMDP setting with bandit feedback in both stochastic and adversarial settings, which is more practical. The authors provide theoretical guarantees on regret and constraint violation, with a specific analysis of the dependency on Slater’s condition. By focusing on policy space optimization rather than occupancy measures, the paper presents a more computationally efficient alternative for real-world applications where solving convex programs per episode is infeasible.
- The paper’s theoretical results rely heavily on Condition 2. As discussed in Section 2.4, $\rho \in [0, H]$; however, it is unclear how this condition could hold even when $T = 1$. Even setting aside this issue, if Condition 2 does hold, it implies that $\rho$ must be a relatively large value as $T$ increases, which is likely to be the case in online settings. According to Theorem 4, the Lagrangian multiplier is bounded by $c/\rho^2$, resulting in a very small constant close to zero. In such c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Auction Theory and Applications
