Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning

Zhizuo Chen; Theodore T. Allen

arXiv:2511.17598·cs.LG·December 3, 2025

Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning

Zhizuo Chen, Theodore T. Allen

PDF

Open Access

TL;DR

This paper introduces NVMDPs, a flexible framework for reinforcement learning that handles non-stationarity and varying discount rates, extending traditional MDPs and improving algorithm robustness in dynamic environments.

Contribution

The paper develops the NVMDP framework, extending MDPs to non-stationary, varying-discount settings, and adapts dynamic programming and policy gradient algorithms with convergence guarantees.

Findings

01

NVMDP algorithms recover optimal trajectories in non-stationary environments.

02

Traditional Q-learning fails under non-stationarity, while NVMDP-based methods succeed.

03

Theoretical foundations and convergence proofs for NVMDP algorithms are established.

Abstract

Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research