Online Convex Optimization in Adversarial Markov Decision Processes
Aviv Rosenberg, Yishay Mansour

TL;DR
This paper develops an online learning algorithm for adversarial Markov decision processes with unknown transitions, achieving improved regret bounds by using entropic regularization and extending to convex performance criteria.
Contribution
It introduces a novel online algorithm for adversarial MDPs that handles unknown transitions and convex performance measures, with improved regret bounds.
Findings
Achieves $ ilde{O}(L|X| ext{sqrt}(|A|T))$ regret bound.
Extends adversarial MDP model to convex performance criteria.
Uses entropic regularization for algorithm implementation.
Abstract
We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. We show regret bound, where is the number of episodes, is the state space, is the action space, and is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria (different ways to aggregate the losses of a single episode) , as well as improve previous regret bounds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
