Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards
Kihyun Yu, Seoungbin Bae, Dabeen Lee

TL;DR
This paper introduces a near-optimal primal-dual algorithm for learning in linear mixture CMDPs with adversarial rewards, achieving provably efficient regret and constraint violation bounds.
Contribution
It presents the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards, with near-optimal regret bounds and novel dual update techniques.
Findings
Achieves regret bounds of (\u221a{d^2 H^3 K})
First efficient algorithm for linear mixture CMDPs with adversarial rewards
Extends weighted ridge regression to constrained setting for tighter confidence intervals
Abstract
We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of under mild conditions, where is the feature dimension, is the horizon, and is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
