Reinforcement Learning for Non-Stationary Markov Decision Processes: The   Blessing of (More) Optimism

Wang Chi Cheung; David Simchi-Levi; Ruihao Zhu

arXiv:2006.14389·cs.LG·June 26, 2020·25 cites

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu

PDF

Open Access 1 Video

TL;DR

This paper introduces new algorithms for reinforcement learning in non-stationary Markov decision processes, achieving optimal regret bounds by handling drifting reward and transition distributions with innovative optimism techniques.

Contribution

The paper presents the SWUCRL2-CW and BORL algorithms, which adapt to unknown non-stationarity in MDPs and establish new regret bounds using confidence widening.

Findings

01

SWUCRL2-CW achieves regret bounds with known variation budgets.

02

BORL adapts to unknown variation budgets without prior knowledge.

03

Novel confidence widening technique addresses challenges in non-stationary RL.

Abstract

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management