Global Algorithms for Mean-Variance Optimization in Markov Decision Processes
Li Xia, Shuai Ma

TL;DR
This paper introduces a novel global algorithm for mean-variance optimization in Markov decision processes, transforming the problem into a bilevel MDP and efficiently finding the optimal policy with proven convergence.
Contribution
It proposes a new bilevel MDP approach using pseudo mean and variance, enabling the first efficient global optimization algorithm for mean-variance in MDPs.
Findings
Algorithm converges to the global optimum.
Numerical experiments show high efficiency.
Applicable to variance minimization as well.
Abstract
Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Electric Vehicles and Infrastructure · Optimization and Search Problems
Global Algorithms for Mean-Variance Optimization in Markov Decision Processes
Li Xia, Shuai Ma L. Xia and S. Ma are both with the School of Business, Sun Yat-Sen University, Guangzhou 510275, China. (email: [email protected]).
Abstract
Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. These results are also valid for solely minimizing the variance metrics in MDPs.
Keywords: Markov decision process, mean-variance optimization, bilevel MDP, pseudo mean, pseudo variance, global optimum
1 Introduction
Mean-variance optimization is an important model for the risk control in finance engineering, which was first proposed by Markowitz (1952) for single-period portfolio management. Extending to multi-period scenarios is a natural but challenging research topic. This is because the variance criterion in multi-period is not additive, which induces the time inconsistency and the failure of dynamic programming. This important topic attracts research attention over past decades (Dai et al., 2021; Gao and Li, 2013; Hernandez-Lerma et al., 1999; Sobel, 1994, 1982), while it is not completely solved yet.
Since Markov models are widely used to study multi-period stochastic systems, there is rich literature on Markov decision processes (MDPs) with variance related criteria, either for discounted or undiscounted, discrete-time or continuous-time, discrete-state or continuous-state, finite-horizon or infinite-horizon MDPs. Excellent works can be referred to Chung (1994); Filar and Lee (1985); Haskell and Jain (2013); Hernandez-Lerma et al. (1999); Sobel (1982, 1994); Guo and Song (2009), just to name a few. Many of these works study the variance minimization of accumulated rewards in a policy set, in which the mean performance has already been optimized. In such scenarios, the variance minimization problem can be equivalently converted to another standard MDP with a new cost function (Guo et al., 2012; Huang, 2018; Sobel, 1982; Xia, 2018). These approaches are not applicable to directly optimize variance or mean-variance combined metrics in MDPs when the mean performance is not optimized. Another method to study the mean-variance optimization of MDPs is to reformulate these problems as mathematical programming models and to do further analytical investigations (Chung, 1994; Haskell and Jain, 2013; Sobel, 1994). How to efficiently solve these mathematical programs is challenging.
Another research stream on multi-period mean-variance optimization is from the perspective of stochastic control. The seminal work by Li and Ng (2000); Zhou and Li (2000) formulated the mean-variance portfolio selection problem as a linear quadratic (LQ) control problem and used an embedding method to develop an iterative procedure to analytically solve this problem. There are numerous works following this research line (Gao and Li, 2013; Zhou and Yin, 2004; Zhu et al., 2004) and interested audience can refer to a recent survey paper (Cui et al., 2022). However, these works use an LQ model with linear state transitions, which may properly characterize the portfolio selection problem but lack much generalization compared with Markov models.
Recently there are also some works that study mean-variance optimization in the regime of reinforcement learning. Although the principle of dynamic programming fails, gradient-based algorithms for parameterized policies (represented by neural networks) still work. Most of these studies focus on improving the sampling efficiency for learning the gradient estimators for variance related metrics (Borkar, 2010; Prashanth and Ghavamzadeh, 2013; Tamar et al., 2012). A recent progress is to reformulate mean-variance optimization with Fenchel duality (Xie et al., 2018), and to adopt gradient-based algorithms to find local optima (Bisi et al., 2020; Zhang et al., 2021). However, all these gradient-based learning algorithms suffer from slow convergence speed and trap into local optima. Globally solving the mean-variance optimization problem in MDPs is still an unanswered question.
In this paper, we study global algorithms for the mean-variance optimization problem in an infinite-horizon discrete-time undiscounted MDP. The mean and variance of rewards are measured in a steady-state environment, similar to those in the works by Bisi et al. (2020); Chung (1994); Sobel (1994); Xia (2016). By introducing an auxiliary variable called pseudo mean , we convert the steady-state mean-variance optimization problem to a bilevel MDP problem, where the inner level is a standard MDP optimizing the so-called pseudo mean-variance and the outer level is a single parameter selection problem optimizing the pseudo mean . With the sensitivity analysis of MDPs, we show that the optimal value of the pseudo mean-variance optimization problem is a convex piecewise quadratic function with respect to and its global optimum equals the optimum of the mean-variance optimization problem. We further discover policy dominance properties which help us discard the worse policies dominated by the optimal policy of . Thus, the optimization complexity can be significantly reduced. Based on these properties, we develop an iterative algorithm which is shown to find the global optimum of the mean-variance optimization problem after a finite number of iterations. The computation complexity and some variants of the algorithm are also studied. Compared with the literature work only capable of finding a local optimum of mean-variance optimization in MDPs (Xia, 2020), our algorithms guarantee a global convergence. The performance and efficiency of our algorithms are also demonstrated by numerical experiments. To the best of our knowledge, our work is the first to compute the globally optimal policies of mean-variance optimization in MDPs.
The rest of the paper is organized as follows. In Section 2, we give the MDP formulation for the mean-variance optimization problem. Section 3 presents the main results of this paper, including the policy dominance property and the algorithmic analysis. Numerical experiments are conducted in Section 4 to demonstrate the performance of our algorithms. Finally, we conclude this paper in Section 5.
2 Problem Formulation
Consider an infinite-horizon discrete-time MDP denoted by a tuple , where is the state space, is the action space, is the state transition probability kernel with element where represents a mapping to the distribution on , and is the reward function with element , , . When the system is in state and action is adopted, it will transit to the next state with probability and a reward is incurred. Since deterministic policies can attain optimal mean and variance in MDPs (Haskell and Jain, 2013; Xia, 2020), we only consider stationary deterministic policies , where indicates the action adopted in state . The corresponding policy space is denoted by and we assume that the MDP with any policy is a unichain. When a policy is adopted, the state transition probability matrix is denoted by and its -th element is , . The associated steady-state distribution is denoted by an -dimensional row vector . Obviously, we have
[TABLE]
where is a column vector of 1’s with a proper dimension size. We consider long-run performance metrics of this MDP, which are independent of the initial state at time 0. The long-run average (mean) reward of the MDP under policy is defined as
[TABLE]
where indicates the expectation under policy , is the system state at time , is the action adopted at time , is an -dimensional column vector whose element is , . Similarly, the long-run variance (or steady-state variance) of the MDP under policy is defined as (Xia, 2016, 2020)
[TABLE]
where is the component-wise square of vector , i.e.,
[TABLE]
When the finite Markov chain is a unichain, we can view as a random variable whose value realization set is and distribution is , where is the vector of initial state distribution. We can verify that
[TABLE]
Mean-variance optimization was originally proposed by Markowitz (1952) for portfolio selection, where decision makers usually aim at maximizing the mean return while minimizing the variance risk, which is a multi-objective optimization problem. Usually, the Pareto frontier composed of Pareto efficient solutions is the optimization goal, which is illustrated by Fig. 1. A common way of obtaining Pareto optima is to optimize the combined objective
[TABLE]
where is the tradeoff weight between mean and variance. Therefore, our goal is to solve the mean-variance optimization problem:
[TABLE]
Note that and depend on , and we may also use and if necessary. In Fig. 1, the red star points are Pareto efficient solutions which dominate the black dot solutions. The dashed curve is the Pareto frontier which can be obtained by solving (4) with different . We can also observe that the dashed line is tangent to the Pareto frontier, where the slope is and the tangent point is .
How to efficiently solve (4) is the key of the mean-variance optimization in MDPs. Since the variance function depends on history and future behaviors through , it is not either additive or Markovian. The mean-variance optimization problem (4) does not fit a standard model of MDPs and the principle of dynamic programming fails (Chung, 1994; Sobel, 1994; Xia, 2016). Although there is a recent progress on this problem by using the technique of sensitivity-based optimization instead of the traditional dynamic programming (Xia, 2020), it can only find a local optimum of this mean-variance optimization problem. A local optimum is not guaranteed as a Pareto efficient solution. Thus, finding the global optimum of (4) is still an unsolved problem in the mean-variance optimization of MDPs and we accomplish this challenge in the rest of this paper.
3 Main Results
First, we introduce the concept of pseudo mean and pseudo variance of an MDP under policy (Xia, 2016, 2020):
[TABLE]
where is called the pseudo variance of the MDP with the pseudo mean . We can derive that the difference between the pseudo variance and the real variance is
[TABLE]
We call the variance distortion caused by the pseudo mean . Interestingly, we observe that the pseudo variance is a convex quadratic function of , since . When the pseudo mean equals the real mean , the variance distortion is zero and the pseudo variance attains its minimum which is exactly the real variance, i.e.,
[TABLE]
Remark 1. The above property of variance is analogous to CVaR (Conditional Value at Risk) discovered by Rockafellar and Uryasev (2000): the CVaR of random variable at probability level is defined as , and equals , where is the inverse distribution function of , , is a convex function of and its minimum attains at .
With this property (7), we can convert the original mean-variance optimization problem to a bilevel MDP problem and directly derive the following lemma.
Lemma 1** (Bilevel MDP).**
The mean-variance optimization problem (4) is equivalent to a bilevel MDP problem where the inner one is a standard MDP with cost function :
[TABLE]
The above bilevel MDP formulation is similar to the Fenchel duality formulation (Xie et al., 2018), while our formulation (8) naturally comes from (6) of pseudo variance which was originally discovered by Xia (2016). The inner problem aims to optimize the pseudo mean-variance, which is a standard MDP denoted by tuple . We can use traditional dynamic programming to solve this MDP. For different outer variable of pseudo mean , we have to solve different MDP . The number of solving inner MDPs is equal to the number of , which is computationally intractable. Therefore, efficiently solving this bilevel MDP problem (8) is challenging.
With (7), we see that the optimal in (8) satisfies for the optimal policy . Therefore, we can restrict ’s value domain from to a much smaller set , where . Although is still computationally intractable, we know that , where and . Therefore, the bilevel MDP problem (8) can be rewritten as
[TABLE]
Since the inner problem is a standard MDP, we can derive a concise proof about the optimality of deterministic policies (detailed proofs can also be referred to Haskell and Jain (2013); Xia (2020)): Suppose is an optimal solution to (9). It is well known that there exists a deterministic policy which attains the minimum of standard MDP . It is obvious that must be the real mean of the MDP under policy . Thus, , which indicates that the deterministic policy attains the minimum of mean-variance performance.
When the pseudo mean is fixed, the inner standard MDP is an auxiliary problem, and its long-run average performance under policy is a combined performance . We also call a pseudo mean-variance optimization problem:
[TABLE]
For notation simplicity, sometimes we may omit , and use and if no confusion caused. Therefore, the bilevel MDP (9) for mean-variance optimization can be rewritten as below.
[TABLE]
If we plot a curve of with respect to , we can observe that is the global minimum of this curve at point and the corresponding is the optimal policy of the original problem (4). From the sensitivity analysis of MDPs, we can derive the following lemma.
Lemma 2** (Critical points).**
There exists a series of intervals with , in which the optimal policy of remains unvaried as when .
Proof.
We rewrite the standard MDP problem as a linear programming (LP) model:
[TABLE]
The above problem can be represented as a standard LP model:
[TABLE]
where we utilize the fact , the -by- matrix and the -dimensional column vector are determined by the constraint equations in (11), , , and are -dimensional column vector with element and , respectively. We observe that the right-hand-side of (12) is a parametric linear programming (PLP) (Gal and Greenberg, 1997; Tan and Hartman, 2011) with a linear parameter . Below we do sensitivity analysis for this PLP problem. For a given , suppose is the optimal solution of (12) and its associated basis matrix is . We can verify that in this LP is equivalent to the optimal policy of the MDP , where the optimal action in state is , . With the terminology of LP, we denote , , , and . The optimality test of the simplex method requires that all the test coefficients should be nonpositive, i.e.,
[TABLE]
For notation simplicity, we denote the -dimensional test coefficients vector as
[TABLE]
In order to find the interval that any therein makes remain optimal, we only need to solve satisfying . It is easy to verify that the solution is
[TABLE]
Obviously, we can first let , and use (15) and (16) to obtain and , respectively. Other ’s can be computed sequentially, and . The lemma is proved. ∎
We call such ’s in Lemma 2 critical points for the sensitivity analysis of MDP , and is the number of critical points. Actually, by using the specific structures of of the LP for , we can verify that
[TABLE]
which is a generalized fundamental matrix in MDPs (Xia and Glynn, 2016), where the policy corresponds to the vector of feasible basic variables of the basis matrix . The associated vector equals . The matrix has a similar structure of , i.e., , where is an -by- matrix whose element if /otherwise, is an -by- matrix whose element , is an -by- matrix of 1’s. We also observe that is the column vector of the cost function of the MDP under policy (associated with ). We can derive that is equal to the performance potential or relative value function in MDPs (Cao, 2007; Puterman, 1994). Thus, we can verify that in (13) coincides with the Poisson equation in MDPs
[TABLE]
where and are -dimensional column vector of performance potentials for the MDP under policy with cost function and , respectively. Thus, we can rewrite the element of (13) and (14) as
[TABLE]
where we utilize a fact that equals the long-run average performance of the MDP under policy , which can be verified from the Poisson equation. Therefore, we can substitute (18) and (19) into (15) and (16) to compute all the critical points ’s.
Remark 2. Equations (18) and (19) interestingly indicate that the test coefficient in LP coincides with the advantage function which is a key quantity widely used in reinforcement learning (Sutton and Barto, 2018). means that action at state induces a smaller long-run average cost than the current policy in MDPs, which hints that and the corresponding variable should be an entering basic variable in LP.
By using (6), we can obtain the relation between the pseudo and real mean-variance combined performances as
[TABLE]
Since the optimal policy of remains the same as for any , we have
[TABLE]
That is, each piece of curves in Fig. 2 is a convex quadratic function of , and the whole curve is the minimum of all these quadratic functions. With (20), it is interesting to note that all the piecewise curves have the same shape (the same term of ) but different positions. At each critical point , we can validate that both and are optimal policies of MDP , so is continuous in . Therefore, we directly derive the following lemma.
Lemma 3**.**
The pseudo mean-variance performance is a convex piecewise quadratic function and continuous in , and its global minimum is the optimal solution of (4).
From Fig. 2, we can observe that is difficult to solve, because may have multiple local optima which has a zero derivative, i.e.,
[TABLE]
This indicates that a local optimum must satisfy the following fixed point equation
[TABLE]
Note that the fixed point solutions of (21), as indicated by “inverted” triangles in Fig. 2, are not necessarily local optima of . The reason is when a fixed point is also a critical point , we can verify that its left-derivative is 0 (or positive) and its right-derivative is negative (or 0), and such point is not a local optimum of , as illustrated by Fig. 2. We can verify that the policies indicated by all these fixed point solutions of (21) are exactly the so-called local optimal policies in mixed or randomized policy space of MDPs, as discovered by Xia (2020). These two kinds of local optima are different: local optima in this paper are included by local optimal policies (fixed point solutions) defined in Xia (2020), as illustrated by triangles and “inverted” triangles in Fig. 2, respectively.
Note that is optimal only for the pseudo problem (10), not for the original problem (4). Fortunately, we discover that has a better mean-variance performance than some other policies, which is described by the following lemma.
Lemma 4** (Policy dominance).**
For any , is an optimal policy of the MDP in (10). If a policy satisfying , then
[TABLE]
Proof.
Since is an optimal policy of the standard MDP problem , we have
[TABLE]
With (6), we derive
[TABLE]
Substituting the above equation into (23), we have
[TABLE]
For any policy satisfying , we have
[TABLE]
Substituting (25) into (24), we directly obtain (22), and the lemma is proved. ∎
Moreover, if satisfies , we have , and the inequality in (22) strictly holds. Therefore, Lemma 4 indicates that dominates all the policies whose means lie in the interval , and these dominated policies can be removed from the policy space to save computation. We illustrate this property by Fig. 3, where we can see that the shadow area can be discarded since the policies therein are always dominated by . Thus, we can utilize Lemma 4 to significantly reduce the complexity of the mean-variance problem (4).
With Lemma 4, we can develop an algorithm to iteratively solve the bilevel MDP problem (8), which is described by Algorithm 1. The key idea is to solve a series of auxiliary problems ’s, and repeatedly reduce the auxiliary variable ’s value domain by using Lemma 4. When is shrunk to be empty, the best-so-far solution of ’s is the global optimum of the bilevel MDP problem (8) or (9). The global convergence of Algorithm 1 is guaranteed by the following theorem.
Theorem 1**.**
Algorithm 1 converges to the global optimum of the mean-variance optimization problem after a finite number of iterations.
Proof.
To prove the convergence of Algorithm 1, we only need to prove that the value domain is reduced to an empty set after a finite number of iterations. From the algorithm procedure, we can observe that for each iteration of solving an auxiliary problem , we will derive a policy and remove a square area with y-axis interval , as stated by Lemma 4. This removed area at least contains the policy , as illustrated by the 1st and 2nd iterations in Fig. 4. Usually, it contains multiple policies dominated by the policy , as illustrated by Fig. 3. If the current policy has already been removed by previous domain shrinking operations, the current domain shrinking operation will remove at least the interval , as illustrated by the 3rd, 4th, and 5th iterations in Fig. 4, which can be verified by the fact of being the median value of and Lemma 4. In summary, each domain shrinking operation will either delete at least one policy (not deleted previously) or delete at least one interval . It is easy to verify that the largest number of intervals is , where each iteration only deletes a very small area around . Therefore, in the worst case, we need iterations to delete every policy and iterations to delete every interval. The algorithm will stop after at most iterations.
Since each dominates all the other policies located in the shrunk area of the -th iteration, the best-so-far solution among all ’s is the global optimum of the original mean-variance optimization problem. This completes the proof. ∎
Fig. 4 gives an illustration of the worst case for an example of a policy space with only 2 solutions, it requires iterations to cover the whole interval . From the proof of Theorem 1, we directly derive the following corollary about the computational complexity of Algorithm 1.
Corollary 1**.**
The computational complexity of Algorithm 1 is times of solving .
Although the above computational complexity is not attractive, it accounts for the worst case. Numerical experiments in Section 4 demonstrate that the convergence speed of Algorithm 1 is very fast in most cases.
Remark 3. By changing the update rule of in Algorithm 1, we can obtain different version of algorithms. One example is to let , i.e., the pseudo mean is set as the real mean of the optimal policy of . Such revised algorithm is very similar to the policy iteration algorithm for solving local optimality equation in Xia (2020), both converge to a fixed point solution to (21).
Besides Lemma 4, we may further improve the shrinking efficiency of dominated areas by using other properties. From the viewpoint of bi-objective optimization in Fig. 1, we observe that the minimization of objective is interpreted to find the last solution tangent with the line of slope when the line is moving toward top-left. All the solutions located at the down-right side of this line have a worse objective than that of the solution . This fact is illustrated by Fig. 5.
Therefore, based on an optimal policy by solving the pseudo mean-variance MDP , we directly derive the following lemma about the shrinkage of dominated areas.
Lemma 5**.**
*For any policy , the policies in the following areas are dominated by and can be discarded:
① any policy satisfying ;
② any policy satisfying and .*
The area ① in Lemma 5 is similar to the area discarded by Lemma 4, both are square areas and have no constraints on variances. Therefore, we can utilize the rule ① in Lemma 5 to speed up the domain shrinking of in Algorithm 1. That is, at line 7 of Algorithm 1, we can add an extra operation to discard the area ① indicated by Lemma 5:
[TABLE]
We call such algorithm revision Algorithm 1-Plus, whose performance is compared in our numerical experiments in Section 4. For example, in Fig. 4, when is relatively small, the Iteration 5 will be saved if we apply (26) for policy at the Iteration 2. This demonstrates that Algorithm 1-Plus is computationally saving compared with Algorithm 1.
Remark 4. It is easy to verify that all the results in this paper can be extended to solely minimizing the steady-state variance of MDPs. One trivial method is to let the coefficient in (4) large enough to approximate the variance minimization of MDPs. Actually, if we replace the mean-variance objective in (4) with the variance , we can rigorously prove that all the previous results hold for this variance minimization problem. Algorithm 1 also works to find the optimal policy that attains the global minimum of the variance in MDPs.
4 Numerical Experiments
In this section, we validate the proposed algorithms with a multi-period inventory control problem, where we consider both the steady-state mean and variance of rewards. This problem is modeled as an infinite-horizon discrete-time undiscounted MDP. The inventory capacity is . At each epoch , the inventory level is reviewed and an order is made. The demands are independent and identically distributed, where is a binomial distribution and is the probability of success. There is no lead time and the next inventory level is determined as . The reward function is , where and are ordering, holding and shortage costs per unit, respectively. By default, we set and . We run algorithms 50 replications for statistical analysis.
Fig. 6 illustrates an example of the convergence process of Algorithm 1, where the interval is covered iteratively and the global optimum is found after only 6 iterations. This demonstrates the efficiency of Algorithm 1, although the policy space is large as .
As a comparison, we also implement the local optimization algorithm proposed by Xia (2020). Considering that the mean-variance optimization of this problem usually has multiple local optima, we illustrate the performance comparison of these two algorithms in Fig. 7, where different problem sizes are used. We can see that our global algorithm has much better performance and the local algorithm by Xia (2020) may converge to different local optima shown by the whiskers of standard deviations.
Fig. 8 shows the curves of optimal pseudo mean-variance with respect to the pseudo mean . For capacity , the global optimum is and the other two local optima are 5.376 and 6.382, which coincide with the left pair of bars in Fig. 7. The pseudo mean corresponding to is , which also equals the mean of the star point in the last subfigure of Fig. 6. All these demonstrate that our Algorithm 1 truly finds the global optimum and the local algorithm by Xia (2020) randomly converges to different local optima. Moreover, when the capacity increases, the curve of has more local optima and the local algorithm is more possibly trapped in a worse local optimum. This also explains the big performance gaps in Fig. 7 when the capacity is large.
Furthermore, we study the effect of risk coefficient on the curve , as illustrated in Fig. 9. We observe that the problem complexity is increasing with respect to . When is small, the curve has only a single local optimum, which indicates that the problem is easy to solve. This is because a mean-variance problem with a small is approximately equivalent to only optimizing the mean performance, which is a standard MDP easy to solve. Oppositely, when is large, the curve has multiple local optima and the associated optimization problem is difficult to solve.
Finally, we study the effect of Lemma 5 on the algorithm efficiency. We compare the performance difference between Algorithm 1 and Algorithm 1-Plus under different capacities and ’s. We observe that Algorithm 1-Plus can achieve a significant efficiency improvement when the problem size (capacity) is large, as shown in Fig. 10(a). When is changed, there are three cases as shown in Fig. 10(b):
When is relatively small (), the variance is trivial, and the mean-variance optimization is approximately equivalent to a mean optimization problem which is a standard MDP. The problem is relatively easy, and these two algorithms have similar efficiency; 2. 2.
When is relatively large (), Lemma 5 may rarely remove areas with means smaller than , which can be illustrated by the intercept in Fig. 5 when the line slope is large. Thus, these two algorithms also have similar efficiency in this case; 3. 3.
In other cases, Lemma 5 significantly improves the algorithm convergence speed, and Algorithm 1-Plus is quite more efficient than Algorithm 1.
5 Discussion and Conclusion
This paper proposes the global algorithms for solving multi-period mean-variance optimization in the framework of MDPs, which is a long-standing challenge caused by the failure of dynamic programming. We convert this problem to a bilevel MDP formulation, where the inner optimization is a standard MDP for pseudo mean-variance optimization and the outer one is a single parameter selection problem optimizing pseudo mean . Interestingly, the optimal value of is a convex piecewise quadratic function of . By the square form difference between the real variance and the pseudo variance, we discover policy dominance properties to help remove worse policy spaces iteratively. The global optimum can be found by repeatedly removing these dominated policy spaces. The convergence and efficiency of our algorithms are studied both theoretically and experimentally.
Our work demonstrates a promising approach to globally optimize the steady-state mean-variance metrics in undiscounted MDPs. It is meaningful to further extend our approach to mean-variance optimization of discounted MDPs. Another interesting topic is to develop reinforcement learning algorithms based on our global optimization approach, which can make our approach implementable in a data-driven environment.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bisi et al. (2020) Bisi, L., Sabbioni, L., Vittori, E., Papini, M., and Restelli, M. (2020). Risk-averse trust region optimization for reward-volatility reduction. Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’2020) , Special Track on AI in Fin Tech, 4583-4589.
- 2Borkar (2010) Borkar, V. (2010). Learning algorithms for risk-sensitive control. Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems (MTNS’2010) , July 5-9, 2010, Budapest, Hungary, 1327-1332.
- 3Cao (2007) Cao, X. R. (2007). Stochastic Learning and Optimization – A Sensitivity-Based Approach . New York: Springer.
- 4Chung (1994) Chung, K. J. (1994). Mean-variance tradeoffs in an undiscounted MDP: The unichain case. Operations Research 42, 184-188.
- 5Cui et al. (2022) Cui, X. Y., Gao, J. J., Li, X., and Shi, Y. (2022). Survey on multi-period mean-variance portfolio selection model. Journal of the Operations Reserach Society of China 10, 599-622.
- 6Dai et al. (2021) Dai, M., H. Jin, S. Kou, Y. Xu. (2021). A dynamic mean-variance analysis for log returns. Management Science 67(2), 1093-1108.
- 7Filar and Lee (1985) Filar, J. A. and Lee, H. M. (1985). Gain/variability tradeoffs in undiscounted Markov decision processes. Proceedings of the 24th IEEE Conference on Decision and Control (CDC’1985) , 1106-1112.
- 8Gal and Greenberg (1997) Gal, T. and Greenberg, H. J. (1997). Advances in Sensitivity Analysis and Parametric Programming . Kluwer, Dordrecht.
