Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes
Sang Bin Moon, Abolfazl Hashemi

TL;DR
This paper introduces a new variant of Adversarial Markov Decision Processes that uses cost predictors to achieve optimistic regret bounds, improving learning efficiency in non-adversarial, dynamic environments.
Contribution
It develops a novel policy search method with optimistic regret bounds for AMDPs, overcoming limitations of existing importance-weighted estimators and feedback models.
Findings
Achieves sublinear regret with high probability
Develops a new biased cost estimator leveraging predictors
Demonstrates effectiveness through numerical experiments
Abstract
The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is pessimistic regret analysis results in the sense that although the cost function can change from one episode to the next, the evolution in many settings is not adversarial. To address this, we introduce and study a new variant of AMDP, which aims to minimize regret while utilizing a set of cost predictors. For this setting, we develop a new policy search method that achieves a sublinear optimistic regret with high probability, that is a regret bound which gracefully degrades with the estimation power of the cost predictors. Establishing such optimistic regret bounds is nontrivial given that (i) as we demonstrate, the existing importance-weighted cost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Adversarial Robustness in Machine Learning · Distributed Sensor Networks and Detection Algorithms
MethodsSparse Evolutionary Training
