Distributional Method for Risk Averse Reinforcement Learning
Ziteng Cheng, Sebastian Jaimungal, Nick Martin

TL;DR
This paper presents a distributional reinforcement learning approach for risk-averse policies in Markov decision processes, leveraging neural networks to efficiently handle randomized policies and avoid the curse of dimensionality.
Contribution
It introduces a novel distributional method for risk-averse reinforcement learning that effectively incorporates randomized policies and exploits problem structure to mitigate dimensionality issues.
Findings
The proposed method successfully avoids the curse of dimensionality.
Neural network approximation effectively models the value distribution.
The approach performs well across various randomly chosen model parameters.
Abstract
We introduce a distributional method for learning the optimal policy in risk averse Markov decision process with finite state action spaces, latent costs, and stationary dynamics. We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures constructed from nested Kusuoka-type conditional risk mappings. For such performance criteria, randomized policies may outperform deterministic policies, therefore, the candidate policies lie in the d-dimensional simplex where d is the cardinality of the action space. Existing risk averse reinforcement learning methods seldom concern randomized policies, na\"ive extensions to current setting suffer from the curse of dimensionality. By exploiting certain structures embedded in the corresponding dynamic programming principle, we propose a distributional learning method for seeking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Statistical Methods and Inference
Distributional Method for Risk Averse Reinforcement Learning
††thanks: SJ would like to acknowledge support from the Natural Sciences and Engineering Research Council of Canada (grants RGPIN-2018-05705 and RGPAS-2018-522715).
Ziteng Cheng14, Sebastian Jaimungal14 and Nick Martin34 1 Deptartment of Statistical Sciences, University of Toronto, Canada {sebastian.jaimungal, ziteng.cheng}@utoronto.ca3 [email protected] Equal contribution
Abstract
We introduce a distributional method for learning the optimal policy in risk averse Markov decision process with finite state action spaces, latent costs, and stationary dynamics. We assume sequential observations of states, actions, and costs and assess the performance of a policy using dynamic risk measures constructed from nested Kusuoka-type conditional risk mappings. For such performance criteria, randomized policies may outperform deterministic policies, therefore, the candidate policies lie in the d-dimensional simplex where d is the cardinality of the action space. Existing risk averse reinforcement learning methods seldom concern randomized policies, naive extensions to current setting suffer from the curse of dimensionality. By exploiting certain structures embedded in the corresponding dynamic programming principle, we propose a distributional learning method for seeking the optimal policy. The conditional distribution of the value function is casted into a specific type of function, which is chosen with in mind the ease of risk averse optimization. We use a deep neural network to approximate said function, illustrate that the proposed method avoids the curse of dimensionality in the exploration phase, and explore the method’s performance with a wide range of model parameters that are picked randomly.
Index Terms:
risk averse, Markov decision process, reinforcement learning, deep learning
I Introduction
Markov Decision Processes (MDPs) are a type of discrete-time stochastic control problem used for sequential decision-making in situations where costs are partially random and partially under the control of a decision maker. In risk-averse MDPs, the decision maker is concerned with the risk or variability of the outcomes beyond the expected costs. One way to incorporate risk aversion into MDPs is to use nested compositions of risk transition mappings. This approach ensures the time-consistency property and ultimately enables the use of a dynamic programming principle (DPP) to solve the corresponding sequential optimization problem. The approach is proposed in [21], where deterministic costs are considered. Both finite and infinite (required bounded costs) time horizon DPP are derived. Subsequent studies such as [23] and [8] explore infinite time horizon risk-averse DPPs with unbounded costs in different settings. [2] considers unbounded latent costs and established the corresponding finite and infinite horizon DPPs. More recently, [7] has developed a framework based on Kusuoka-type conditional risk mappings that also takes into account randomized actions in a risk averse manner. Depending on the type of conditional risk mappings used, randomized actions may be more preferable than deterministic actions, as illustrated in a motivating example in [7]. Other methods of incorporating risk aversion into MDPs are also available, including those discussed in [3], [8], [5], and the references therein.
The focus of this paper is on the approach of nested compositions of risk transition mappings. The main objective is to develop a reinforcement learning method that solves the infinite horizon risk-averse MDP problem presented in [7]. Specifically, the aim is to solve this problem with finite state and action spaces, deterministic latent costs, and stationary dynamics, without assuming knowledge of the controlled transition matrix or cost function. We begin by briefly reviewing some algorithms that solve infinite horizon risk-averse MDP problems.
For instance, [26] derives a policy gradient formula by combining the static gradient formula for coherent risk measure with the corresponding DPP. This approach is further developed in a sample-based method in [25], with its convergence analyzed in [12]. [29] proposes a family of sample-based algorithms to approximately solve problems with continuous state and action spaces. [24] presents and analyzes a risk-averse Q-learning algorithm, while [13] extends the previous Q-learning algorithm based on estimating a general minimax function with stochastic approximation, with detailed error analysis conducted in [14]. [17] studies a risk-averse temporal difference method that evaluates the value function using linear function approximations. Finally, the recent work in [9] develops an approach to address risk transition mappings induced by convex risk measures.
However, the methods mentioned above do not directly apply to the problem presented in [7], where the risk aversion also involves the randomness in the randomized actions. This is mainly because of the lack of linearity: the value function of a randomized action may not be a linear combination of the value functions of individual actions with respect to the randomizing action kernel. Naively extending the existing methods may result in a situation where we need to learn the value functions for numerous pairs of states and action kernels. Since the admissible action kernels form a -dimensional simplex, where is the size of the action space, the exploration task that follows may suffer from the curse of dimensionality and demand a significant amount of data. On the other hand, the finite nature of the underlying state and action spaces suggests that we can avoid such an excessively expensive exploration task.
We propose a distributional method to address the challenges posed by the risk aversion towards randomized actions in the problem presented in [7]. The proposed method learns an auxiliary function that contains sufficient information about the value function’s distribution, avoiding the curse of dimensionality and facilitating the computation of the value function defined via a risk transition mapping. We show in Theorem III.2 that the proposed method’s exploration effort grows polynomially with the state and action space cardinalities. Although we initially considered deterministic latent costs, our method naturally handles random costs whose distribution depends on the current state, the realized action, and the next state. This type of random cost is seldom considered in existing literature on risk-averse reinforcement learning. We provide numerical examples that demonstrate the efficacy of the proposed method at the end of this report.
Using distributional methods to solve MDP problems that are not risk neutral has a long history (cf.[15], [27], [19], and the reference therein). More recently, a series of works including [4], [10], [28], [20], and [30] have demonstrated that distributional methods can also achieve better results in the risk neutral setting. In this broader context, our method also contributes to the understanding of the capabilities of distributional methods in solving MDP problems.
II Preliminaries
In this section, we present the set up the this paper.
II-A Markov decision process
Let be a probability space. We consider a time-homogeneous Markov decision process (MDP) with a finite state space and finite action space . For each , let be a controlled transition matrix, where is the probability of transitioning to state at the next epoch, given the current state and action . Let be a stationary Markovian policy, where is the set of probability measures on . Since is finite, is a -dimensional simplex, and for , is the probability of action occurring. The state-action process subject to policy is denoted by . The MDP is associated with a bounded latent cost function , where is the upper bound of the cost. Finally, we let be the discount factor.
II-B Risk averse dynamic programming
In this paper, we use the notation to denote the space of bounded real-valued Borel-measurable random variables. Equality and inequality between random variables are understood in a -almost sure sense. Let be a -algebra. We say is a conditional risk mapping if satisfies the following conditions for any , and ,
- (i)
[Monotonicity] if , then
- (ii)
[Translation equivariance]
- (iii)
[Convexity] if , then
- (iv)
[Positive homogeneity]
Some may replace in condition (iii) (resp. (iv)) with (resp. ), which, for the most part, does not affects the developing of the theory.
Consider , and . Suppose for all . [21] proposes to use a dynamic risk measure of the form
[TABLE]
for MDP optimization problem. It can be shown that the construction above guarantees time consistency of .111We do not adopt verbatim the setting from [21] for the sake of smooth transition.
In what follows, we let be the trivial -algebra, be the -algebra generated by , and be a set of discrete probability measures with support contained by . We consider a specific type of conditional risk mapping
[TABLE]
where the right hand side is inspired by Kusuoka representation of law-invariant coherent risk measure (cf. [18], [22, Section 6]). We note that
[TABLE]
is the conditional version of under the conditional distribution of given . The main goal of this paper is to develop a sample-based algorithm that solves the following infinite horizon risk averse MDP optimization problem
[TABLE]
where is defined analogously to (II.1).
The problem (II.3) can be solved using a dynamic programming principle. Specifically, we let be the Bellman operator acting on defined as
[TABLE]
We can restrict to take values in due to the boundedness of the cost. This allows us to replace in (II.4) with . It can be shown that is a -contraction, and the fixed point of , denoted by , is the optimal value function. If attains the infimum in for all , then is the optimal stationary policy, and in fact, it is also optimal among all history-dependent policies. We refer to [7] for more discussion in a general setting.
III Distributional method for risk-averse learning
In this section, we introduce a novel concept called -values. We then propose a learning method based on -values and establish a convergence result under suitable conditions, as stated in Theorem III.2. Finally, we provide a detailed description of the algorithm for implementing the method.
III-A -value
In view of (II.4), we define the -value as
[TABLE]
and derive the following equation for -learning
[TABLE]
However, learning the function on a fine grid of turns out to be excessively expensive. Therefore, instead of continuing with (III.2), we propose to learn the following -value222By using to express trapezoidal shaped functions and invoking dominated convergence (cf. [6, Theorem 2.8.1]) and monotone class theorem (cf. [6, Theorem 1.9.3 (ii)]), it can be shown that the function , , characterizes the distribution of .333One may derive an equation for -value in analogous to (III.2), but such equation needs not leads to a contraction in general.
[TABLE]
Such -value has an advantage of being linear in , which helps mitigates the cost of exploration. Moreover, it is worth noting that is non-increasing and -Lipschitz for any , which will be useful in future analysis. Furthermore, we argue that -value is aligned with our goal of solving (II.3), since by the aforementioned DPP and (III.3), we have
[TABLE]
This formula shows that the -value is a crucial ingredient in our approach for solving (II.3).
III-B Theoretical foundation
Suppose that we have observed the running states, actions and costs subject to some exploration policy upto time , resulting in a set of data , where . In order to approximate , we employ a parameterized model , where is the parameter, and is designated to approximate , where is the Dirac measure on . In view of (III.3), can be approximated by . We use to denote the estimate of the optimal parameter (if exists). In view of the bounded cost and positive discount factor, we approximate with . Heuristically, we want to update and recursively in the following way
[TABLE]
where we define
[TABLE]
It is well-known that is the MLE of the transition probability (cf. [1]). Note that \sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}c_{t}+\gamma\hat{v}(x_{t+1})-q\big{)}_{+} is a convex and -Lipschitz function of that falls within the range . Therefore, although the objective involves the supremum over an uncountable set, updating is not infeasible. However, such an update requires knowledge of . To circumvent this requirement, we observe that for fixed,
[TABLE]
as a function of , attains the infimum if
[TABLE]
We can then use the following updating scheme as an alternative
[TABLE]
In order to obtain a convergence result, we make the following technical assumption.
Assumption III.1**.**
Let , , , be absolute constants. We assume that
- (i)
the range of the cost function is contained by ;
- (ii)
;
- (iii)
* is subject to an exploration policy such that*
[TABLE]
for any , where ;
- (iv)
regardless of the data and , we always find and such that, for all ,
[TABLE]
and
[TABLE]
Condition (i) and (ii) follows automatically from the setting above; these conditions are included in the assumption for the sake of easy navigation. Condition (iii) is a version of parallel sampling model (PSM). PSM was originally introduced in [16] and is commonly used in reinforcement learning literature as an exploration policy that achieves perfect exploration (cf. [11]). Condition (iv) regards the accuracy of the update. In particular, (III.8) corresponds to the computation of in (III.7). Based on the separability of the objective illustrated in (III.5), the convexity of \big{(}y_{ik}-(C(i,k,j)+\gamma\hat{v}(j)-q)_{+}\big{)}^{2} in , and the observed good behavior of q\mapsto\sum_{j\in\mathbb{X}}\hat{T}^{k}_{ij}\big{(}c_{t}+\gamma\hat{v}(x_{t+1})-q\big{)}_{+}, we consider (III.8) reasonable.
Below is our main result. The proof is deferred to the appendix.
Theorem III.2**.**
Suppose Assumption III.1. Let . Given data and an arbitrary , we compute according to (III.7), approximately as in Assumption III.1 (iv). Then, for any , there is a probability of at least
[TABLE]
that
[TABLE]
for all .
Sometimes it is advisable to assume that is proportional to . In order to maintain the same level of accuracy (in terms of the probability bound), we need to set . In this case, the effort required for exploration only needs to grow polynomially as increases.
III-C *Implementation *
In our algorithm, we use a deep neural network for . We let and be a pre-selected grid on . We are given data , and an a priori guess of the value function.
Instead of following strictly (III.7), we consider the updating procedure below
[TABLE]
where is a penalization for ensuring monotonicity on , defined as
[TABLE]
and is the regularization parameter. We use stochastic gradient descent for . After (III.10) is done, we perform
[TABLE]
where we recall that is a finite set of discrete probabilities on , and thus the is in fact a finite sum. We use gradient descent with random initialization for , and random search for . After (III.11) is done, we may return to (III.10) for next round of update. In order to obtain an approximated optimal policy , we should record the approximated minimizor of for each .
We summarize the implementation in Algorithm (1). We point out that, Algorithm (1) can be integrate asynchronously into a larger implementation that involves running data.
IV Numeric experiments
In this section, we present numerical experiments to validate the performance of Algorithm 1. We consider a state-action space with , and a discount factor of . We use the following for the conditional risk mapping (II.2):
[TABLE]
where denotes the Dirac measure. The transition matrices used in the experiment are randomly generated. Although our algorithm was introduced for deterministic latent costs, it also handles random costs without requiring significant modifications. We test the algorithm with various random costs, such as with depending on the current state, the realized action, and next state. We assume the knowledge of , and set as a uniform partition of with . We set and sample according to some randomly picked stationary policy. We then compute and using Algorithm 1. To ensure accuracy when updating and , we perform a thorough random search. However, we conjecture that there is a certain structure that we can take advantage of in learning and , and the computation cost does not grow exponentially as increases. In Figure 1, we plot the relative errors of for each in 10 different experiments. The benchmark in each experiment is computed using brute force search.
Appendix A Proof of Theorem III.2
We fix and for the remainder of this section. Firstly, we will introduce the contraction property of .
Lemma A.1**.**
For any , .
Proof.
This is an immediate consequence of [7, Lemma 3.3]. ∎
The proof of Theorem III.2 is also dependent on the following two technical lemmas.
Lemma A.2**.**
For any , and integer , we have
[TABLE]
Proof.
To start with note that
[TABLE]
Therefore,
[TABLE]
In order to investigate the first term in right hand side of (A), we introduce an auxiliary process. For , we let
[TABLE]
Note that is a sub-martingale under the filtration . Indeed, by the Markov property of , we have
[TABLE]
where we have used Assumption III.1 (iii) in the last equality. Then, by Azuma’s inequality, for ,
[TABLE]
Regarding the second term in (A), we define and
[TABLE]
Note that is a -martingale:
[TABLE]
where we have used the Markov property of in the second line. It follows from Azuma’s inequality that
[TABLE]
Finally, by combining (A), (A) and (A.3), we complete the proof. ∎
Lemma A.3**.**
Let be the fixed point of defined in (II.4). Let and be introduced as in Assumption III.1 (iv). Then,
[TABLE]
Proof.
To start with, by (III.9) and the fact that ,
[TABLE]
Then, by Assumption III.1 (ii), (III.8) and Lemma A.1,
[TABLE]
The proof is complete. ∎
We are now in position to prove Theorem III.2.
Proof of Theorem III.2.
We first simplify Lemma A.2 by letting
[TABLE]
where we have used the fact that in the second line. Consequently,
[TABLE]
Finally, under the realization that \bigg{|}\frac{\sum_{t=1}^{t_{\text{max}}-1}\mathbbm{1}_{(i,k,j)}(x_{t},a_{t},x_{t+1})}{\sum_{t=1}^{t_{\max}-1}\mathbbm{1}_{(i,k)}(x_{t},a_{t})}-T^{k}_{ij}\bigg{|}\leq\varepsilon for all , invoking (A.3) iteratively, we yield
[TABLE]
which completes the proof. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. W. Anderson and L. A. Goodman, “Statistical inference about markov chains,” The Annals of Mathematical Statistics , vol. 28, no. 1, p. 89–110, 1957.
- 2[2] N. Bäuerle and A. Glauner, “Markov decision processes with iterated coherent risk measures,” European Journal of Operational Research , vol. 296, no. 3, pp. 953–966, 2022.
- 3[3] N. Bäuerle and U. Rieder, “More risk-sensitive markov decision processes,” Mathematics of Operations Research , vol. 39, no. 1, pp. 105–120, 2013.
- 4[4] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” Proceedings of Machine Learning Research , vol. 70, pp. 449–458, 2017.
- 5[5] T. R. Bielecki, T. Chen, and I. Cialenco, “Risk-sensitive markov decision problems under model uncertainty: finite time horizon case,” ar Xiv:2104.06915 , 2021.
- 6[6] V. I. Bogachev, Measure Theory Volume I . Springer-Verlag Berlin Heidelberg, 2007.
- 7[7] Z. Cheng and S. Jaimungal, “Markov decision processes with kusuoka-type conditional risk mappings,” Preprint , 2022.
- 8[8] S. Chu and Y. Zhang, “Markov decision processes with iterated coherent risk measures,” International Journal of Control , vol. 88, no. 11, pp. 2286–2293, 2014.
