Online Convex Optimization in Adversarial Markov Decision Processes

Aviv Rosenberg; Yishay Mansour

arXiv:1905.07773·cs.LG·May 21, 2019·45 cites

Online Convex Optimization in Adversarial Markov Decision Processes

Aviv Rosenberg, Yishay Mansour

PDF

Open Access

TL;DR

This paper develops an online learning algorithm for adversarial Markov decision processes with unknown transitions, achieving improved regret bounds by using entropic regularization and extending to convex performance criteria.

Contribution

It introduces a novel online algorithm for adversarial MDPs that handles unknown transitions and convex performance measures, with improved regret bounds.

Findings

01

Achieves $ ilde{O}(L|X| ext{sqrt}(|A|T))$ regret bound.

02

Extends adversarial MDP model to convex performance criteria.

03

Uses entropic regularization for algorithm implementation.

Abstract

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes, and the transition function is not known to the learner. We show $\tilde{O} (L ∣ X ∣ ∣ A ∣ T)$ regret bound, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. Our online algorithm is implemented using entropic regularization methodology, which allows to extend the original adversarial MDP model to handle convex performance criteria (different ways to aggregate the losses of a single episode) , as well as improve previous regret bounds.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference