Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning
Zhizuo Chen, Theodore T. Allen

TL;DR
This paper introduces NVMDPs, a flexible framework for reinforcement learning that handles non-stationarity and varying discount rates, extending traditional MDPs and improving algorithm robustness in dynamic environments.
Contribution
The paper develops the NVMDP framework, extending MDPs to non-stationary, varying-discount settings, and adapts dynamic programming and policy gradient algorithms with convergence guarantees.
Findings
NVMDP algorithms recover optimal trajectories in non-stationary environments.
Traditional Q-learning fails under non-stationarity, while NVMDP-based methods succeed.
Theoretical foundations and convergence proofs for NVMDP algorithms are established.
Abstract
Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research
