TL;DR
This paper introduces a primal-dual policy optimization algorithm for adversarial linear CMDPs, achieving sublinear regret and constraint violation bounds in an online setting with adversarial losses.
Contribution
It presents the first algorithm with sublinear regret and violation bounds for adversarial linear CMDPs, using weighted LogSumExp softmax policies and novel analysis techniques.
Findings
Achieves $ ilde{O}(K^{3/4})$ regret and violation bounds.
Introduces weighted LogSumExp softmax policies for adversarial environments.
Validates theoretical results with numerical experiments.
Abstract
Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by , where denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
