Online Reinforcement Learning in Periodic MDP
Ayush Aniket, Arpan Chattopadhyay

TL;DR
This paper addresses learning in periodic Markov Decision Processes by proposing algorithms that adapt to periodicity, achieving regret bounds that depend on the period and horizon, with improved performance when transition sparsity is exploited.
Contribution
The paper introduces PUCRL2 and PUCRLB algorithms for periodic MDPs, with regret bounds depending on the period and horizon, and extends to unknown periods with U-PUCRL2 and U-PUCRLB.
Findings
PUCRL2 has regret linear in period N and sublinear in T.
PUCRLB improves regret to depend on sqrt(N).
Algorithms perform well in numerical experiments.
Abstract
We study learning in periodic Markov Decision Process (MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period and as with the horizon length . Utilizing the information about the sparsity of transition matrix of augmented MDP, we propose another algorithm PUCRLB which enhances upon PUCRL2, both in terms of regret ( dependency on period) and empirical performance. Finally, we propose two other algorithms U-PUCRL2 and U-PUCRLB for extended uncertainty in the environment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural and Behavioral Psychology Studies
