Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs
Jiafan He, Dongruo Zhou, Quanquan Gu

TL;DR
This paper introduces POWERS, an optimistic policy optimization algorithm for adversarial linear mixture MDPs, achieving near-optimal regret bounds and advancing reinforcement learning in adversarial environments.
Contribution
The paper proposes a novel algorithm POWERS with a new value estimator and tighter confidence sets, achieving nearly minimax optimal regret in adversarial linear MDPs.
Findings
Achieves $ ilde{O}(dH oot{T})$ regret bound.
Proves a matching lower bound up to logarithmic factors.
Introduces a new importance-weighted value estimator.
Abstract
Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode. We propose an optimistic policy optimization algorithm POWERS and show that it can achieve regret, where is the length of the episode, is the number of interactions with the MDP, and is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of up to logarithmic factors. Our key technical contributions are two-fold: (1) a new value function estimator based on importance weighting; and (2) a tighter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research
