Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Maximilian N\"agele; Jan Olle; Thomas F\"osel; Remmy Zen; Florian Marquardt

arXiv:2405.13609·cs.LG·May 26, 2025·1 cites

Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Maximilian N\"agele, Jan Olle, Thomas F\"osel, Remmy Zen, Florian Marquardt

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method to transform non-cumulative Markov decision processes into standard MDPs, enabling the use of existing reinforcement learning techniques for a broader class of problems with arbitrary reward functions.

Contribution

The authors propose a general mapping from NCMDPs to standard MDPs, allowing existing algorithms to be applied to non-cumulative objectives in various applications.

Findings

01

Improved performance over standard MDP approaches.

02

Reduced training time in diverse tasks.

03

Applicable to control, finance, and optimization problems.

Abstract

Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, a large class of problems does not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. Focusing on reinforcement learning,…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

Definition 1 and the possible applications of the method are interesting.

Weaknesses

The variables u, h and $\rho$ are discussed in the paragraph following the statements of equations 3,4,5. You should introduce them before. Also, their description is too vague. For example "This can be achieved by extending the state space with ht , which preserves all necessary information about the reward history" does not give me a good sense of what the function u should be. I suggest the statement of theorem 1 be rearranged to: assumptions then conclusion, instead of the current: assumpti

Reviewer 02Rating 3Confidence 3

Strengths

- The NCMDP setting considered in this paper fits many applications that does not directly fit into the MDP setting with cumulative rewards. Some examples include the weakest-link problem in network routing which maximizes minimum reward, the Sharpe ratio in finance which maximizes the mean divided by standard deviation. - The paper provides a straightforward solution to NCMDPs by first map an NCMDP to a standard MDP, which allows direct application of black box MDP solvers. - It provides comp

Weaknesses

- The mapping from NCMDP to MDP provided in equations (3)-(5) augments the state with $h_t$ that represent necessary information for reward history, and is updated as $h_{t+1}=u(h_t,\tilde{r}_t)$ and satisfies $r_t=\rho(h_t,\tilde{r}_t)$. For arbitrary reward function $f(r)$, an essential factor to ensure the mapping to MDP is of reasonable size is to find an efficient functional form of $u$ and $\rho$ that summarizes this information from reward history. However, the paper only shows a list of

Reviewer 03Rating 6Confidence 4

Strengths

* The authors consider the problem of control in NCMDPs, which is important and understudied in the RL community. * The proposed mapping of NCMDPs to standard MDPs is simple and intuitive, and can effectively solve many problem instances of NCMDPs. * The empirical studies are concrete, showing both the necessity of considering the problem of control in NCMDPs and the effectiveness of the proposed mappings.

Weaknesses

* The paper has technical flaws, i.e., the function $f$ is ill-defined. In the original problem statement, $f$ is defined as a function on $\mathbb{R}^T$. But later on, the authors also use notations like $f(r_1,...,r_t)$, where $f$ should be treated as a function on $\mathbb{R}^t$. I think it may be helpful to define $f$ as a function of a set ($f(\\{r_1,...,r_t\\})$, $\\{r_1,...,r_t\\}$ is the set of $t$ rewards, $t$ can be any integer between $1$ and $T$). This works for all problem instances

Code & Models

Repositories

maxnaeg/zxreinforce
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making

MethodsSparse Evolutionary Training