Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

TL;DR
This paper introduces an efficient algorithm for learning in adversarial episodic MDPs with unknown transitions and bandit feedback, achieving near-optimal regret bounds comparable to full-information settings.
Contribution
It presents the first algorithm to attain $ ilde{O}( oot{T})$ regret in this complex setting, with novel confidence sets and an optimistic loss estimator.
Findings
Achieves $ ilde{O}(L|X| oot{A}T)$ regret with high probability.
First to ensure $ ilde{O}( oot{T})$ regret in adversarial MDPs with bandit feedback.
Introduces tighter confidence sets and an inverse-weighted loss estimator.
Abstract
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves regret with high probability, where is the horizon, is the number of states, is the number of actions, and is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
