Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv, Rosenberg

TL;DR
This paper develops algorithms for reinforcement learning in adversarial episodic MDPs with unknown, changing delays in feedback, achieving near-optimal regret bounds that significantly improve over previous methods.
Contribution
It introduces the first algorithms that attain near-optimal regret in adversarial MDPs with arbitrary, changing delays in feedback.
Findings
Achieved regret bound proportional to +D.
Improved regret bounds from to near-optimal levels.
Demonstrated effectiveness in adversarial, delayed feedback scenarios.
Abstract
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode is revealed only in the end of episode , where the delay can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal regret, where is the number of episodes and is the total delay, significantly improving upon the best known regret bound of .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
