Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Chi Jin; Tiancheng Jin; Haipeng Luo; Suvrit Sra; Tiancheng Yu

arXiv:1912.01192·cs.LG·November 3, 2020·34 cites

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

PDF

Open Access

TL;DR

This paper introduces an efficient algorithm for learning in adversarial episodic MDPs with unknown transitions and bandit feedback, achieving near-optimal regret bounds comparable to full-information settings.

Contribution

It presents the first algorithm to attain $ ilde{O}( oot{T})$ regret in this complex setting, with novel confidence sets and an optimistic loss estimator.

Findings

01

Achieves $ ilde{O}(L|X| oot{A}T)$ regret with high probability.

02

First to ensure $ ilde{O}( oot{T})$ regret in adversarial MDPs with bandit feedback.

03

Introduces tighter confidence sets and an inverse-weighted loss estimator.

Abstract

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\tilde{O} (L ∣ X ∣ ∣ A ∣ T)$ regret with high probability, where $L$ is the horizon, $∣ X ∣$ is the number of states, $∣ A ∣$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\tilde{O} (T)$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $upper occupancy bound$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics