Online Reinforcement Learning for Periodic MDP

Ayush Aniket; Arpan Chattopadhyay

arXiv:2207.12045·cs.LG·July 26, 2022

Online Reinforcement Learning for Periodic MDP

Ayush Aniket, Arpan Chattopadhyay

PDF

Open Access

TL;DR

This paper introduces PUCRL2, an algorithm for learning in periodic MDPs where transition and reward functions change periodically, achieving sub-linear regret with respect to horizon length.

Contribution

The paper formulates periodic MDPs as stationary by augmenting state space and proposes PUCRL2, a novel algorithm with regret bounds that depend linearly on the period.

Findings

01

Regret of PUCRL2 scales linearly with the period.

02

PUCRL2 achieves sub-linear regret with respect to horizon length.

03

Numerical results confirm the effectiveness of PUCRL2.

Abstract

We study learning in periodic Markov Decision Process(MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period and as sub-linear with the horizon length. Numerical results demonstrate the efficacy of PUCRL2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics