Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Tiancheng Jin; Tal Lancewicki; Haipeng Luo; Yishay Mansour; Aviv; Rosenberg

arXiv:2201.13172·cs.LG·January 24, 2023·1 cites

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv, Rosenberg

PDF

Open Access 1 Video

TL;DR

This paper develops algorithms for reinforcement learning in adversarial episodic MDPs with unknown, changing delays in feedback, achieving near-optimal regret bounds that significantly improve over previous methods.

Contribution

It introduces the first algorithms that attain near-optimal regret in adversarial MDPs with arbitrary, changing delays in feedback.

Findings

01

Achieved regret bound proportional to +D.

02

Improved regret bounds from to near-optimal levels.

03

Demonstrated effectiveness in adversarial, delayed feedback scenarios.

Abstract

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^{k}$ , where the delay $d^{k}$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $K + D$ regret, where $K$ is the number of episodes and $D = \sum_{k = 1}^{K} d^{k}$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms