Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism
Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu

TL;DR
This paper introduces new algorithms for reinforcement learning in non-stationary Markov decision processes, achieving optimal regret bounds by handling drifting reward and transition distributions with innovative optimism techniques.
Contribution
The paper presents the SWUCRL2-CW and BORL algorithms, which adapt to unknown non-stationarity in MDPs and establish new regret bounds using confidence widening.
Findings
SWUCRL2-CW achieves regret bounds with known variation budgets.
BORL adapts to unknown variation budgets without prior knowledge.
Novel confidence widening technique addresses challenges in non-stationary RL.
Abstract
We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management
