Global Algorithms for Mean-Variance Optimization in Markov Decision   Processes

Li Xia; Shuai Ma

arXiv:2302.13710·math.OC·February 28, 2023

Global Algorithms for Mean-Variance Optimization in Markov Decision Processes

Li Xia, Shuai Ma

PDF

Open Access

TL;DR

This paper introduces a novel global algorithm for mean-variance optimization in Markov decision processes, transforming the problem into a bilevel MDP and efficiently finding the optimal policy with proven convergence.

Contribution

It proposes a new bilevel MDP approach using pseudo mean and variance, enabling the first efficient global optimization algorithm for mean-variance in MDPs.

Findings

01

Algorithm converges to the global optimum.

02

Numerical experiments show high efficiency.

03

Applicable to variance minimization as well.

Abstract

Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can…

Equations69

π^{d} P^{d} = π^{d}, π^{d} e = 1, P^{d} e = e,

π^{d} P^{d} = π^{d}, π^{d} e = 1, P^{d} e = e,

μ^{d} := T \to \infty lim E^{d} {\frac{1}{T} t = 0 \sum T - 1 r (X_{t}, A_{t})} = π^{d} r^{d},

μ^{d} := T \to \infty lim E^{d} {\frac{1}{T} t = 0 \sum T - 1 r (X_{t}, A_{t})} = π^{d} r^{d},

σ^{d} := T \to \infty lim E^{d} {\frac{1}{T} t = 0 \sum T - 1 [r (X_{t}, A_{t}) - μ^{d}]^{2}} = π^{d} (r^{d} - μ^{d} e)_{⊙}^{2},

σ^{d} := T \to \infty lim E^{d} {\frac{1}{T} t = 0 \sum T - 1 [r (X_{t}, A_{t}) - μ^{d}]^{2}} = π^{d} (r^{d} - μ^{d} e)_{⊙}^{2},

(r^{d} - μ^{d} e)_{⊙}^{2} := ((r (1, d (1)) - μ^{d})^{2}, (r (2, d (2)) - μ^{d})^{2}, \dots, (r (S, d (S)) - μ^{d})^{2})^{T} .

(r^{d} - μ^{d} e)_{⊙}^{2} := ((r (1, d (1)) - μ^{d})^{2}, (r (2, d (2)) - μ^{d})^{2}, \dots, (r (S, d (S)) - μ^{d})^{2})^{T} .

μ^{d}

μ^{d}

σ^{d}

η^{d} := β σ^{d} - μ^{d},

η^{d} := β σ^{d} - μ^{d},

\mbox{(P0): \hskip 113.81102pt }\begin{array}[]{cc}&\eta^{*}=\min\limits_{d\in\mathcal{D}}\{\beta\sigma^{d}-\mu^{d}\},\\ &d^{*}\in\operatorname*{argmin}\limits_{d\in\mathcal{D}}\{\beta\sigma^{d}-\mu^{d}\}.\end{array}

\mbox{(P0): \hskip 113.81102pt }\begin{array}[]{cc}&\eta^{*}=\min\limits_{d\in\mathcal{D}}\{\beta\sigma^{d}-\mu^{d}\},\\ &d^{*}\in\operatorname*{argmin}\limits_{d\in\mathcal{D}}\{\beta\sigma^{d}-\mu^{d}\}.\end{array}

\tilde{σ}^{d} (y) = π^{d} (r^{d} - y e)_{⊙}^{2}, y \in R,

\tilde{σ}^{d} (y) = π^{d} (r^{d} - y e)_{⊙}^{2}, y \in R,

Δ^{d} (y)

Δ^{d} (y)

\sigma^{d}=\min_{y\in\mathbb{R}}\tilde{\sigma}^{d}(y)=\tilde{\sigma}^{d}(y^{*})\Big{|}_{y^{*}=\mu^{d}}.

\sigma^{d}=\min_{y\in\mathbb{R}}\tilde{\sigma}^{d}(y)=\tilde{\sigma}^{d}(y^{*})\Big{|}_{y^{*}=\mu^{d}}.

η^{*} = d \in D min {β σ^{d} - μ^{d}} = y \in R min d \in D min {β \tilde{σ}^{d} (y) - μ^{d}} .

η^{*} = d \in D min {β σ^{d} - μ^{d}} = y \in R min d \in D min {β \tilde{σ}^{d} (y) - μ^{d}} .

η^{*} = d \in D min {β σ^{d} - μ^{d}} = y \in [\underline{r}, \overline{r}] min d \in D min {β \tilde{σ}^{d} (y) - μ^{d}} .

η^{*} = d \in D min {β σ^{d} - μ^{d}} = y \in [\underline{r}, \overline{r}] min d \in D min {β \tilde{σ}^{d} (y) - μ^{d}} .

(\mathcal{M}(y)):\hskip 85.35826pt\begin{array}[]{cc}&\tilde{\eta}^{*}(y)=\min\limits_{d\in\mathcal{D}}\{\beta\tilde{\sigma}^{d}(y)-\mu^{d}\}.\\ &\tilde{d}^{*}(y)\in\operatorname*{argmin}\limits_{d\in\mathcal{D}}\{\beta\tilde{\sigma}^{d}(y)-\mu^{d}\}.\end{array}

(\mathcal{M}(y)):\hskip 85.35826pt\begin{array}[]{cc}&\tilde{\eta}^{*}(y)=\min\limits_{d\in\mathcal{D}}\{\beta\tilde{\sigma}^{d}(y)-\mu^{d}\}.\\ &\tilde{d}^{*}(y)\in\operatorname*{argmin}\limits_{d\in\mathcal{D}}\{\beta\tilde{\sigma}^{d}(y)-\mu^{d}\}.\end{array}

η^{*} = y \in [\underline{r}, \overline{r}] min {\tilde{η}^{*} (y)} .

η^{*} = y \in [\underline{r}, \overline{r}] min {\tilde{η}^{*} (y)} .

\begin{array}[]{rcl}\tilde{\eta}^{*}(y)&=&\min\limits_{\bm{x}}\left\{\sum\limits_{i\in S}\sum\limits_{a\in\mathcal{A}}[\beta(r(i,a)-y)^{2}-r(i,a)]x(i,a)\right\}\\ &\mbox{s.t., }&\sum\limits_{a\in\mathcal{A}}x(i,a)=\sum\limits_{j\in S}\sum\limits_{a\in\mathcal{A}}p(i|j,a)x(j,a),\qquad\forall i\in\mathcal{S},\\ &&\sum\limits_{i\in S}\sum\limits_{a\in\mathcal{A}}x(i,a)=1,\\ &&x(i,a)\geq 0,\qquad\forall i\in\mathcal{S},a\in\mathcal{A}.\end{array}

\begin{array}[]{rcl}\tilde{\eta}^{*}(y)&=&\min\limits_{\bm{x}}\left\{\sum\limits_{i\in S}\sum\limits_{a\in\mathcal{A}}[\beta(r(i,a)-y)^{2}-r(i,a)]x(i,a)\right\}\\ &\mbox{s.t., }&\sum\limits_{a\in\mathcal{A}}x(i,a)=\sum\limits_{j\in S}\sum\limits_{a\in\mathcal{A}}p(i|j,a)x(j,a),\qquad\forall i\in\mathcal{S},\\ &&\sum\limits_{i\in S}\sum\limits_{a\in\mathcal{A}}x(i,a)=1,\\ &&x(i,a)\geq 0,\qquad\forall i\in\mathcal{S},a\in\mathcal{A}.\end{array}

\tilde{η}^{*} (y) - β y^{2} = x min {(c + y c^{'})^{T} x ∣ A x = b, x \geq 0},

\tilde{η}^{*} (y) - β y^{2} = x min {(c + y c^{'})^{T} x ∣ A x = b, x \geq 0},

(c_{B} + y c_{B}^{'})^{T} B^{- 1} A - (c + y c^{'})^{T} = (c_{B}^{T} B^{- 1} A - c^{T}) + y (c_{B}^{'}^{T} B^{- 1} A - c^{'}^{T}) \leq 0 .

(c_{B} + y c_{B}^{'})^{T} B^{- 1} A - (c + y c^{'})^{T} = (c_{B}^{T} B^{- 1} A - c^{T}) + y (c_{B}^{'}^{T} B^{- 1} A - c^{'}^{T}) \leq 0 .

ζ^{T}

ζ^{T}

ζ^{'}^{T}

\displaystyle y^{c}_{k-1}=\max_{i\in\mathcal{S},a\in\mathcal{A}}\left\{-\frac{\zeta(i,a)}{\zeta^{\prime}(i,a)}\Bigg{|}\zeta^{\prime}(i,a)<0\right\}\qquad(\max\{\varnothing\}=-\infty),

\displaystyle y^{c}_{k-1}=\max_{i\in\mathcal{S},a\in\mathcal{A}}\left\{-\frac{\zeta(i,a)}{\zeta^{\prime}(i,a)}\Bigg{|}\zeta^{\prime}(i,a)<0\right\}\qquad(\max\{\varnothing\}=-\infty),

\displaystyle y^{c}_{k}=\min_{i\in\mathcal{S},a\in\mathcal{A}}\left\{-\frac{\zeta(i,a)}{\zeta^{\prime}(i,a)}\Bigg{|}\zeta^{\prime}(i,a)>0\right\}\qquad(\min\{\varnothing\}=+\infty).\

[B^{- 1}]^{T} = (I - P^{d} + e e^{T})^{- 1},

[B^{- 1}]^{T} = (I - P^{d} + e e^{T})^{- 1},

g^{d}

g^{d}

g^{'}^{d}

ζ (i, a)

ζ (i, a)

ζ^{'} (i, a)

\tilde{η}^{d} (y) = β [σ^{d} + (y - μ^{d})^{2}] - μ^{d} = η^{d} + β (y - μ^{d})^{2}, \forall d \in D, y \in R .

\tilde{η}^{d} (y) = β [σ^{d} + (y - μ^{d})^{2}] - μ^{d} = η^{d} + β (y - μ^{d})^{2}, \forall d \in D, y \in R .

\begin{array}[]{rcll}\tilde{\eta}^{*}(y)&=&\eta^{\tilde{d}^{*}_{k}}+\beta(y-\mu^{\tilde{d}^{*}_{k}})^{2},&\forall y\in[y^{c}_{k-1},y^{c}_{k}],\\ \tilde{\eta}^{*}(y)&=&\min\limits_{k=1,\dots,K}\left\{\eta^{\tilde{d}^{*}_{k}}+\beta(y-\mu^{\tilde{d}^{*}_{k}})^{2}\right\},&\forall y\in\mathbb{R}.\end{array}

\begin{array}[]{rcll}\tilde{\eta}^{*}(y)&=&\eta^{\tilde{d}^{*}_{k}}+\beta(y-\mu^{\tilde{d}^{*}_{k}})^{2},&\forall y\in[y^{c}_{k-1},y^{c}_{k}],\\ \tilde{\eta}^{*}(y)&=&\min\limits_{k=1,\dots,K}\left\{\eta^{\tilde{d}^{*}_{k}}+\beta(y-\mu^{\tilde{d}^{*}_{k}})^{2}\right\},&\forall y\in\mathbb{R}.\end{array}

\frac{\partial\tilde{\eta}^{*}(y)}{\partial y}\Big{|}_{\hat{y}^{*}}=2\beta(\hat{y}^{*}-\mu^{\tilde{d}^{*}_{k}})=0,\qquad\mbox{if }\hat{y}^{*}\in[y^{c}_{k-1},y^{c}_{k}].

\frac{\partial\tilde{\eta}^{*}(y)}{\partial y}\Big{|}_{\hat{y}^{*}}=2\beta(\hat{y}^{*}-\mu^{\tilde{d}^{*}_{k}})=0,\qquad\mbox{if }\hat{y}^{*}\in[y^{c}_{k-1},y^{c}_{k}].

y = μ^{\tilde{d}^{*} (y)} .

y = μ^{\tilde{d}^{*} (y)} .

β σ^{\tilde{d}^{*}} - μ^{\tilde{d}^{*}} \leq β σ^{d} - μ^{d} .

β σ^{\tilde{d}^{*}} - μ^{\tilde{d}^{*}} \leq β σ^{d} - μ^{d} .

β \tilde{σ}^{\tilde{d}^{*}} (y) - μ^{\tilde{d}^{*}} \leq β \tilde{σ}^{d} (y) - μ^{d}, \forall d \in D .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Electric Vehicles and Infrastructure · Optimization and Search Problems

Full text

Global Algorithms for Mean-Variance Optimization in Markov Decision Processes

Li Xia, Shuai Ma L. Xia and S. Ma are both with the School of Business, Sun Yat-Sen University, Guangzhou 510275, China. (email: [email protected]).

Abstract

Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. These results are also valid for solely minimizing the variance metrics in MDPs.

Keywords: Markov decision process, mean-variance optimization, bilevel MDP, pseudo mean, pseudo variance, global optimum

1 Introduction

Mean-variance optimization is an important model for the risk control in finance engineering, which was first proposed by Markowitz (1952) for single-period portfolio management. Extending to multi-period scenarios is a natural but challenging research topic. This is because the variance criterion in multi-period is not additive, which induces the time inconsistency and the failure of dynamic programming. This important topic attracts research attention over past decades (Dai et al., 2021; Gao and Li, 2013; Hernandez-Lerma et al., 1999; Sobel, 1994, 1982), while it is not completely solved yet.

Since Markov models are widely used to study multi-period stochastic systems, there is rich literature on Markov decision processes (MDPs) with variance related criteria, either for discounted or undiscounted, discrete-time or continuous-time, discrete-state or continuous-state, finite-horizon or infinite-horizon MDPs. Excellent works can be referred to Chung (1994); Filar and Lee (1985); Haskell and Jain (2013); Hernandez-Lerma et al. (1999); Sobel (1982, 1994); Guo and Song (2009), just to name a few. Many of these works study the variance minimization of accumulated rewards in a policy set, in which the mean performance has already been optimized. In such scenarios, the variance minimization problem can be equivalently converted to another standard MDP with a new cost function (Guo et al., 2012; Huang, 2018; Sobel, 1982; Xia, 2018). These approaches are not applicable to directly optimize variance or mean-variance combined metrics in MDPs when the mean performance is not optimized. Another method to study the mean-variance optimization of MDPs is to reformulate these problems as mathematical programming models and to do further analytical investigations (Chung, 1994; Haskell and Jain, 2013; Sobel, 1994). How to efficiently solve these mathematical programs is challenging.

Another research stream on multi-period mean-variance optimization is from the perspective of stochastic control. The seminal work by Li and Ng (2000); Zhou and Li (2000) formulated the mean-variance portfolio selection problem as a linear quadratic (LQ) control problem and used an embedding method to develop an iterative procedure to analytically solve this problem. There are numerous works following this research line (Gao and Li, 2013; Zhou and Yin, 2004; Zhu et al., 2004) and interested audience can refer to a recent survey paper (Cui et al., 2022). However, these works use an LQ model with linear state transitions, which may properly characterize the portfolio selection problem but lack much generalization compared with Markov models.

Recently there are also some works that study mean-variance optimization in the regime of reinforcement learning. Although the principle of dynamic programming fails, gradient-based algorithms for parameterized policies (represented by neural networks) still work. Most of these studies focus on improving the sampling efficiency for learning the gradient estimators for variance related metrics (Borkar, 2010; Prashanth and Ghavamzadeh, 2013; Tamar et al., 2012). A recent progress is to reformulate mean-variance optimization with Fenchel duality (Xie et al., 2018), and to adopt gradient-based algorithms to find local optima (Bisi et al., 2020; Zhang et al., 2021). However, all these gradient-based learning algorithms suffer from slow convergence speed and trap into local optima. Globally solving the mean-variance optimization problem in MDPs is still an unanswered question.

In this paper, we study global algorithms for the mean-variance optimization problem in an infinite-horizon discrete-time undiscounted MDP. The mean and variance of rewards are measured in a steady-state environment, similar to those in the works by Bisi et al. (2020); Chung (1994); Sobel (1994); Xia (2016). By introducing an auxiliary variable called pseudo mean $y\in\mathbb{R}$ , we convert the steady-state mean-variance optimization problem to a bilevel MDP problem, where the inner level is a standard MDP $\mathcal{M}(y)$ optimizing the so-called pseudo mean-variance and the outer level is a single parameter selection problem optimizing the pseudo mean $y$ . With the sensitivity analysis of MDPs, we show that the optimal value of the pseudo mean-variance optimization problem $\mathcal{M}(y)$ is a convex piecewise quadratic function with respect to $y$ and its global optimum equals the optimum of the mean-variance optimization problem. We further discover policy dominance properties which help us discard the worse policies dominated by the optimal policy of $\mathcal{M}(y)$ . Thus, the optimization complexity can be significantly reduced. Based on these properties, we develop an iterative algorithm which is shown to find the global optimum of the mean-variance optimization problem after a finite number of iterations. The computation complexity and some variants of the algorithm are also studied. Compared with the literature work only capable of finding a local optimum of mean-variance optimization in MDPs (Xia, 2020), our algorithms guarantee a global convergence. The performance and efficiency of our algorithms are also demonstrated by numerical experiments. To the best of our knowledge, our work is the first to compute the globally optimal policies of mean-variance optimization in MDPs.

The rest of the paper is organized as follows. In Section 2, we give the MDP formulation for the mean-variance optimization problem. Section 3 presents the main results of this paper, including the policy dominance property and the algorithmic analysis. Numerical experiments are conducted in Section 4 to demonstrate the performance of our algorithms. Finally, we conclude this paper in Section 5.

2 Problem Formulation

Consider an infinite-horizon discrete-time MDP denoted by a tuple $\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{P},\bm{r}\rangle$ , where $\mathcal{S}=\{1,2,\dots,S\}$ is the state space, $\mathcal{A}=\{a_{1},a_{2},\dots,a_{A}\}$ is the action space, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\overset{D}{\mapsto}\mathcal{S}$ is the state transition probability kernel with element $p(j|i,a)$ where $\overset{D}{\mapsto}$ represents a mapping to the distribution on $\mathcal{S}$ , and $\bm{r}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}$ is the reward function with element $r(i,a)$ , $i,j\in\mathcal{S}$ , $a\in\mathcal{A}$ . When the system is in state $i$ and action $a$ is adopted, it will transit to the next state $j$ with probability $p(j|i,a)$ and a reward $r(i,a)$ is incurred. Since deterministic policies can attain optimal mean and variance in MDPs (Haskell and Jain, 2013; Xia, 2020), we only consider stationary deterministic policies $d:\mathcal{S}\mapsto\mathcal{A}$ , where $d(i)\in\mathcal{A}$ indicates the action adopted in state $i$ . The corresponding policy space is denoted by $\mathcal{D}$ and we assume that the MDP with any policy $d\in\mathcal{D}$ is a unichain. When a policy $d$ is adopted, the state transition probability matrix is denoted by $\bm{P}^{d}$ and its $(i,j)$ -th element is $p(j|i,d(i))$ , $i,j\in\mathcal{S}$ . The associated steady-state distribution is denoted by an $S$ -dimensional row vector $\bm{\pi}^{d}:=(\pi^{d}(i))_{i\in\mathcal{S}}$ . Obviously, we have

[TABLE]

where $\bm{e}$ is a column vector of 1’s with a proper dimension size. We consider long-run performance metrics of this MDP, which are independent of the initial state at time 0. The long-run average (mean) reward of the MDP under policy $d$ is defined as

[TABLE]

where $\mathbb{E}^{d}$ indicates the expectation under policy $d$ , $X_{t}$ is the system state at time $t$ , $A_{t}=d(X_{t})$ is the action adopted at time $t$ , $\bm{r}^{d}$ is an $S$ -dimensional column vector whose element is $r(i,d(i))$ , $i\in\mathcal{S}$ . Similarly, the long-run variance (or steady-state variance) of the MDP under policy $d$ is defined as (Xia, 2016, 2020)

[TABLE]

where $(\bm{r}^{d}-\mu^{d}\bm{e})_{\odot}^{2}$ is the component-wise square of vector $(\bm{r}^{d}-\mu^{d}\bm{e})$ , i.e.,

[TABLE]

When the finite Markov chain is a unichain, we can view $r(X_{t},A_{t})$ as a random variable whose value realization set is $\{r(i,d(i)):i\in\mathcal{S}\}$ and distribution is $(\bm{P}^{d})^{t}\bm{\nu}$ , where $\bm{\nu}$ is the vector of initial state distribution. We can verify that

[TABLE]

Mean-variance optimization was originally proposed by Markowitz (1952) for portfolio selection, where decision makers usually aim at maximizing the mean return while minimizing the variance risk, which is a multi-objective optimization problem. Usually, the Pareto frontier composed of Pareto efficient solutions is the optimization goal, which is illustrated by Fig. 1. A common way of obtaining Pareto optima is to optimize the combined objective

[TABLE]

where $\beta\geq 0$ is the tradeoff weight between mean and variance. Therefore, our goal is to solve the mean-variance optimization problem:

[TABLE]

Note that $\eta^{*}$ and $d^{*}$ depend on $\beta$ , and we may also use $\eta^{*}(\beta)$ and $d^{*}(\beta)$ if necessary. In Fig. 1, the red star points are Pareto efficient solutions which dominate the black dot solutions. The dashed curve is the Pareto frontier which can be obtained by solving (4) with different $\beta\geq 0$ . We can also observe that the dashed line is tangent to the Pareto frontier, where the slope is $\beta$ and the tangent point is $(\sigma^{d^{*}(\beta)},\mu^{d^{*}(\beta)})$ .

How to efficiently solve (4) is the key of the mean-variance optimization in MDPs. Since the variance function $(r(i,d(i))-\mu^{d})^{2}$ depends on history and future behaviors through $\mu^{d}$ , it is not either additive or Markovian. The mean-variance optimization problem (4) does not fit a standard model of MDPs and the principle of dynamic programming fails (Chung, 1994; Sobel, 1994; Xia, 2016). Although there is a recent progress on this problem by using the technique of sensitivity-based optimization instead of the traditional dynamic programming (Xia, 2020), it can only find a local optimum of this mean-variance optimization problem. A local optimum is not guaranteed as a Pareto efficient solution. Thus, finding the global optimum of (4) is still an unsolved problem in the mean-variance optimization of MDPs and we accomplish this challenge in the rest of this paper.

3 Main Results

First, we introduce the concept of pseudo mean and pseudo variance of an MDP under policy $d\in\mathcal{D}$ (Xia, 2016, 2020):

[TABLE]

where $\tilde{\sigma}^{d}(y)$ is called the pseudo variance of the MDP with the pseudo mean $y$ . We can derive that the difference between the pseudo variance and the real variance is

[TABLE]

We call $\Delta^{d}(y)$ the variance distortion caused by the pseudo mean $y$ . Interestingly, we observe that the pseudo variance $\tilde{\sigma}^{d}(y)$ is a convex quadratic function of $y$ , since $\tilde{\sigma}^{d}(y)=\sigma^{d}+(y-\mu^{d})^{2}$ . When the pseudo mean $y$ equals the real mean $\mu^{d}$ , the variance distortion is zero and the pseudo variance attains its minimum which is exactly the real variance, i.e.,

[TABLE]

Remark 1. The above property of variance is analogous to CVaR (Conditional Value at Risk) discovered by Rockafellar and Uryasev (2000): the CVaR of random variable $X$ at probability level $\alpha$ is defined as $\mathrm{CVaR}_{\alpha}(X):=\mathbb{E}[X|X\geq F^{-1}_{X}(\alpha)]$ , and equals $\min\limits_{y\in\mathbb{R}}\mathbb{E}[y+\frac{1}{1-\alpha}[X-y]^{+}]$ , where $F^{-1}_{X}(\cdot)$ is the inverse distribution function of $X$ , $[X-y]^{+}:=\max\{0,X-y\}$ , $\mathbb{E}[y+\frac{1}{1-\alpha}[X-y]^{+}]$ is a convex function of $y$ and its minimum attains at $y^{*}=F^{-1}_{X}(\alpha)$ .

With this property (7), we can convert the original mean-variance optimization problem to a bilevel MDP problem and directly derive the following lemma.

Lemma 1 (Bilevel MDP).

The mean-variance optimization problem (4) is equivalent to a bilevel MDP problem where the inner one is a standard MDP with cost function $\beta(\bm{r}-y\bm{e})^{2}_{\odot}-\bm{r}$ :

[TABLE]

The above bilevel MDP formulation is similar to the Fenchel duality formulation (Xie et al., 2018), while our formulation (8) naturally comes from (6) of pseudo variance which was originally discovered by Xia (2016). The inner problem $\min\limits_{d\in\mathcal{D}}\{\beta\tilde{\sigma}^{d}(y)-\mu^{d}\}$ aims to optimize the pseudo mean-variance, which is a standard MDP denoted by tuple $\mathcal{M}(y):=\langle\mathcal{S},\mathcal{A},\mathcal{P},\beta(\bm{r}-y\bm{e})^{2}_{\odot}-\bm{r}\rangle$ . We can use traditional dynamic programming to solve this MDP. For different outer variable of pseudo mean $y$ , we have to solve different MDP $\mathcal{M}(y)$ . The number of solving inner MDPs is equal to the number of $y\in\mathbb{R}$ , which is computationally intractable. Therefore, efficiently solving this bilevel MDP problem (8) is challenging.

With (7), we see that the optimal $y^{*}$ in (8) satisfies $y^{*}=\mu^{d^{*}}$ for the optimal policy $d^{*}\in\mathcal{D}$ . Therefore, we can restrict $y$ ’s value domain from $y\in\mathbb{R}$ to a much smaller set $y\in\mathcal{Y}$ , where $\mathcal{Y}:=\{\mu^{d}:\forall d\in\mathcal{D}\}$ . Although $\mathcal{Y}$ is still computationally intractable, we know that $\mathcal{Y}\subset[\underline{r},\overline{r}]$ , where $\underline{r}:=\min\limits_{i\in\mathcal{S},a\in\mathcal{A}}\{r(i,a)\}$ and $\overline{r}:=\max\limits_{i\in\mathcal{S},a\in\mathcal{A}}\{r(i,a)\}$ . Therefore, the bilevel MDP problem (8) can be rewritten as

[TABLE]

Since the inner problem $\mathcal{M}(y^{*})$ is a standard MDP, we can derive a concise proof about the optimality of deterministic policies (detailed proofs can also be referred to Haskell and Jain (2013); Xia (2020)): Suppose $(y^{*},d^{*})$ is an optimal solution to (9). It is well known that there exists a deterministic policy $d_{0}$ which attains the minimum of standard MDP $\min_{d}\{\beta\tilde{\sigma}^{d}(y^{*})-\mu^{d}\}$ . It is obvious that $y^{*}$ must be the real mean of the MDP under policy $d_{0}$ . Thus, $\beta\tilde{\sigma}^{d_{0}}(y^{*})-\mu^{d_{0}}=\beta\sigma^{d_{0}}-\mu^{d_{0}}$ , which indicates that the deterministic policy $d_{0}$ attains the minimum of mean-variance performance.

When the pseudo mean $y$ is fixed, the inner standard MDP $\mathcal{M}(y)$ is an auxiliary problem, and its long-run average performance under policy $d$ is a combined performance $\tilde{\eta}^{d}(y)=\beta\tilde{\sigma}^{d}(y)-\mu^{d}$ . We also call $\mathcal{M}(y)$ a pseudo mean-variance optimization problem:

[TABLE]

For notation simplicity, sometimes we may omit $y$ , and use $\tilde{\eta}^{*}$ and $\tilde{d}^{*}$ if no confusion caused. Therefore, the bilevel MDP (9) for mean-variance optimization can be rewritten as below.

[TABLE]

If we plot a curve of $\tilde{\eta}^{*}(y)$ with respect to $y$ , we can observe that $\eta^{*}$ is the global minimum of this curve at point $y^{*}$ and the corresponding $\tilde{d}^{*}(y^{*})$ is the optimal policy of the original problem (4). From the sensitivity analysis of MDPs, we can derive the following lemma.

Lemma 2 (Critical points).

There exists a series of intervals $[y^{c}_{k-1},y^{c}_{k}]$ with $\bigcup\limits_{k=1,\dots,K}[y^{c}_{k-1},y^{c}_{k}]=[\underline{r},\overline{r}]$ , in which the optimal policy of $\mathcal{M}(y)$ remains unvaried as $\tilde{d}^{*}_{k}:=\tilde{d}^{*}(y)$ when $y\in[y^{c}_{k-1},y^{c}_{k}]$ .

Proof.

We rewrite the standard MDP problem $\mathcal{M}(y)$ as a linear programming (LP) model:

[TABLE]

The above problem can be represented as a standard LP model:

[TABLE]

where we utilize the fact $\sum\limits_{i\in\mathcal{S},a\in\mathcal{A}}\beta y^{2}x(i,a)=\beta y^{2}$ , the $S$ -by- $SA$ matrix $\bm{A}$ and the $S$ -dimensional column vector $\bm{b}$ are determined by the constraint equations in (11), $\bm{c}=\beta\bm{r}^{2}_{\odot}-\bm{r}$ , $\bm{c}^{\prime}=-2\beta\bm{r}$ , $\bm{r}$ and $\bm{x}$ are $SA$ -dimensional column vector with element $r(i,a)$ and $x(i,a)$ , respectively. We observe that the right-hand-side of (12) is a parametric linear programming (PLP) (Gal and Greenberg, 1997; Tan and Hartman, 2011) with a linear parameter $y$ . Below we do sensitivity analysis for this PLP problem. For a given $y$ , suppose $\bm{x}^{*}_{k}$ is the optimal solution of (12) and its associated basis matrix is $\bm{B}$ . We can verify that $\bm{x}_{k}^{*}$ in this LP is equivalent to the optimal policy $\tilde{d}^{*}_{k}$ of the MDP $\mathcal{M}(y)$ , where the optimal action in state $i$ is $\tilde{d}^{*}_{k}(i)\in\operatorname*{argmax}\limits_{a\in\mathcal{A}}\{x^{*}_{k}(i,a)\}$ , $i\in\mathcal{S}$ . With the terminology of LP, we denote $\bm{A}=[\bm{B},\bm{N}]$ , $\bm{x}=[\bm{x}_{B};\bm{x}_{N}]$ , $\bm{c}=[\bm{c}_{B};\bm{c}_{N}]$ , and $\bm{c}^{\prime}=[\bm{c}^{\prime}_{B};\bm{c}^{\prime}_{N}]$ . The optimality test of the simplex method requires that all the test coefficients should be nonpositive, i.e.,

[TABLE]

For notation simplicity, we denote the $SA$ -dimensional test coefficients vector as

[TABLE]

In order to find the interval $[y^{c}_{k-1},y^{c}_{k}]$ that any $y$ therein makes $\bm{x}^{*}_{k}$ remain optimal, we only need to solve $y$ satisfying $\bm{\zeta}+y\bm{\zeta}^{\prime}\leq\bm{0}$ . It is easy to verify that the solution is

[TABLE]

Obviously, we can first let $y=\underline{r}$ , and use (15) and (16) to obtain $y^{c}_{0}=-\infty$ and $y^{c}_{1}$ , respectively. Other $y^{c}_{k}$ ’s can be computed sequentially, and $y^{c}_{K}=+\infty$ . The lemma is proved. ∎

We call such $y^{c}_{k}$ ’s in Lemma 2 critical points for the sensitivity analysis of MDP $\mathcal{M}(y)$ , and $K+1$ is the number of critical points. Actually, by using the specific structures of $\bm{A},\bm{b},\bm{r}$ of the LP for $\mathcal{M}(y)$ , we can verify that

[TABLE]

which is a generalized fundamental matrix in MDPs (Xia and Glynn, 2016), where the policy $d$ corresponds to the vector of feasible basic variables $\bm{x}_{B}$ of the basis matrix $\bm{B}$ . The associated vector $\bm{b}$ equals $\bm{e}$ . The matrix $\bm{A}$ has a similar structure of $\bm{B}$ , i.e., $\bm{A}=\bm{I}_{e}-\bm{P}^{\mathrm{T}}_{e}+\bm{e}\bm{e}^{\mathrm{T}}$ , where $\bm{I}_{e}$ is an $S$ -by- $SA$ matrix whose element $I_{e}(i,(j,a))=1/0$ if $i=j$ /otherwise, $\bm{P}_{e}$ is an $SA$ -by- $S$ matrix whose element $P_{e}((i,a),j)=p(j|i,a)$ , $\bm{e}\bm{e}^{\mathrm{T}}$ is an $S$ -by- $SA$ matrix of 1’s. We also observe that $\bm{c}_{B}$ is the column vector of the cost function of the MDP under policy $d$ (associated with $\bm{x}_{B}$ ). We can derive that $\bm{c}^{\mathrm{T}}_{B}\bm{B}^{-1}$ is equal to the performance potential or relative value function in MDPs (Cao, 2007; Puterman, 1994). Thus, we can verify that $\bm{c}_{B}^{\mathrm{T}}\bm{B}^{-1}$ in (13) coincides with the Poisson equation in MDPs

[TABLE]

where $\bm{g}^{d}$ and ${\bm{g}^{\prime}}^{d}$ are $S$ -dimensional column vector of performance potentials for the MDP under policy $d$ with cost function $\beta{\bm{r}^{d}}^{2}_{\odot}-\bm{r}^{d}$ and $-2\beta\bm{r}^{d}$ , respectively. Thus, we can rewrite the element of (13) and (14) as

[TABLE]

where we utilize a fact that $\bm{e}^{\mathrm{T}}\bm{g}^{d}$ equals the long-run average performance of the MDP under policy $d$ , which can be verified from the Poisson equation. Therefore, we can substitute (18) and (19) into (15) and (16) to compute all the critical points $y^{c}_{k}$ ’s.

Remark 2. Equations (18) and (19) interestingly indicate that the test coefficient $\zeta(i,a)$ in LP coincides with the advantage function $\tilde{A}^{d}(i,a)$ which is a key quantity widely used in reinforcement learning (Sutton and Barto, 2018). $\tilde{A}^{d}(i,a)<0$ means that action $a$ at state $i$ induces a smaller long-run average cost than the current policy $d$ in MDPs, which hints that $\zeta(i,a)>0$ and the corresponding variable $x(i,a)$ should be an entering basic variable in LP.

By using (6), we can obtain the relation between the pseudo and real mean-variance combined performances as

[TABLE]

Since the optimal policy of $\mathcal{M}(y)$ remains the same as $\tilde{d}^{*}_{k}$ for any $y\in[y^{c}_{k-1},y^{c}_{k}]$ , we have

[TABLE]

That is, each piece of curves in Fig. 2 is a convex quadratic function of $y$ , and the whole curve is the minimum of all these quadratic functions. With (20), it is interesting to note that all the piecewise curves have the same shape (the same term of $\beta y^{2}$ ) but different positions. At each critical point $y^{c}_{k}$ , we can validate that both $\tilde{d}^{*}_{k}$ and $\tilde{d}^{*}_{k+1}$ are optimal policies of MDP $\mathcal{M}(y^{c}_{k})$ , so $\tilde{\eta}^{*}(y)$ is continuous in $y$ . Therefore, we directly derive the following lemma.

Lemma 3.

The pseudo mean-variance performance $\tilde{\eta}^{*}(y)$ is a convex piecewise quadratic function and continuous in $y$ , and its global minimum is the optimal solution of (4).

From Fig. 2, we can observe that $\min\limits_{y}\{\tilde{\eta}^{*}(y)\}$ is difficult to solve, because $\tilde{\eta}^{*}(y)$ may have multiple local optima $\hat{y}^{*}$ which has a zero derivative, i.e.,

[TABLE]

This indicates that a local optimum $\hat{y}^{*}$ must satisfy the following fixed point equation

[TABLE]

Note that the fixed point solutions of (21), as indicated by “inverted” triangles in Fig. 2, are not necessarily local optima of $\tilde{\eta}^{*}(y)$ . The reason is when a fixed point is also a critical point $y^{c}_{k}$ , we can verify that its left-derivative is 0 (or positive) and its right-derivative is negative (or 0), and such point is not a local optimum $\hat{y}^{*}$ of $\tilde{\eta}^{*}(y)$ , as illustrated by Fig. 2. We can verify that the policies indicated by all these fixed point solutions of (21) are exactly the so-called local optimal policies in mixed or randomized policy space of MDPs, as discovered by Xia (2020). These two kinds of local optima are different: local optima $\hat{y}^{*}$ in this paper are included by local optimal policies (fixed point solutions) defined in Xia (2020), as illustrated by triangles and “inverted” triangles in Fig. 2, respectively.

Note that $\tilde{d}^{*}$ is optimal only for the pseudo problem (10), not for the original problem (4). Fortunately, we discover that $\tilde{d}^{*}$ has a better mean-variance performance than some other policies, which is described by the following lemma.

Lemma 4 (Policy dominance).

For any $y\in\mathbb{R}$ , $\tilde{d}^{*}$ is an optimal policy of the MDP $\mathcal{M}(y)$ in (10). If a policy $d\in\mathcal{D}$ satisfying $\mu^{d}\in[y-|y-\mu^{\tilde{d}^{*}}|,\ y+|y-\mu^{\tilde{d}^{*}}|]$ , then

[TABLE]

Proof.

Since $\tilde{d}^{*}$ is an optimal policy of the standard MDP problem $\mathcal{M}(y)$ , we have

[TABLE]

With (6), we derive

[TABLE]

Substituting the above equation into (23), we have

[TABLE]

For any policy $d$ satisfying $\mu^{d}\in[y-|y-\mu^{\tilde{d}^{*}}|,\ y+|y-\mu^{\tilde{d}^{*}}|]$ , we have

[TABLE]

Substituting (25) into (24), we directly obtain (22), and the lemma is proved. ∎

Moreover, if $d$ satisfies $\mu^{d}\in(y-|y-\mu^{\tilde{d}^{*}}|,\ y+|y-\mu^{\tilde{d}^{*}}|)$ , we have $(y-\mu^{\tilde{d}^{*}})^{2}>(y-\mu^{d})^{2}$ , and the inequality in (22) strictly holds. Therefore, Lemma 4 indicates that $\tilde{d}^{*}$ dominates all the policies whose means lie in the interval $[y-|y-\mu^{\tilde{d}^{*}}|,\ y+|y-\mu^{\tilde{d}^{*}}|]$ , and these dominated policies can be removed from the policy space $\mathcal{D}$ to save computation. We illustrate this property by Fig. 3, where we can see that the shadow area can be discarded since the policies therein are always dominated by $\tilde{d}^{*}$ . Thus, we can utilize Lemma 4 to significantly reduce the complexity of the mean-variance problem (4).

With Lemma 4, we can develop an algorithm to iteratively solve the bilevel MDP problem (8), which is described by Algorithm 1. The key idea is to solve a series of auxiliary problems $\mathcal{M}(y)$ ’s, and repeatedly reduce the auxiliary variable $y$ ’s value domain $\mathbb{Y}$ by using Lemma 4. When $\mathbb{Y}$ is shrunk to be empty, the best-so-far solution of $\mathcal{M}(y)$ ’s is the global optimum of the bilevel MDP problem (8) or (9). The global convergence of Algorithm 1 is guaranteed by the following theorem.

Theorem 1.

Algorithm 1 converges to the global optimum of the mean-variance optimization problem after a finite number of iterations.

Proof.

To prove the convergence of Algorithm 1, we only need to prove that the value domain $\mathbb{Y}$ is reduced to an empty set after a finite number of iterations. From the algorithm procedure, we can observe that for each iteration of solving an auxiliary problem $\mathcal{M}(y_{l})$ , we will derive a policy $\tilde{d}^{*}(y_{l})$ and remove a square area with y-axis interval $[y_{l}-|y_{l}-\mu^{\tilde{d}^{*}(y_{l})}|,\ y_{l}+|y_{l}-\mu^{\tilde{d}^{*}(y_{l})}|]$ , as stated by Lemma 4. This removed area at least contains the policy $\tilde{d}^{*}(y_{l})$ , as illustrated by the 1st and 2nd iterations in Fig. 4. Usually, it contains multiple policies dominated by the policy $\tilde{d}^{*}(y_{l})$ , as illustrated by Fig. 3. If the current policy $\tilde{d}^{*}(y_{l})$ has already been removed by previous domain shrinking operations, the current domain shrinking operation will remove at least the interval $\mathbb{Y}_{1}$ , as illustrated by the 3rd, 4th, and 5th iterations in Fig. 4, which can be verified by the fact of $y_{l}$ being the median value of $\mathbb{Y}_{1}$ and Lemma 4. In summary, each domain shrinking operation will either delete at least one policy (not deleted previously) or delete at least one interval $\mathbb{Y}_{1}$ . It is easy to verify that the largest number of intervals is $|\mathcal{D}|+1$ , where each iteration only deletes a very small area around $\tilde{d}^{*}(y_{l})$ . Therefore, in the worst case, we need $|\mathcal{D}|$ iterations to delete every policy and $|\mathcal{D}|+1$ iterations to delete every interval. The algorithm will stop after at most $2|\mathcal{D}|+1$ iterations.

Since each $\tilde{d}^{*}(y_{l})$ dominates all the other policies located in the shrunk area of the $l$ -th iteration, the best-so-far solution among all $\tilde{d}^{*}(y_{l})$ ’s is the global optimum of the original mean-variance optimization problem. This completes the proof. ∎

Fig. 4 gives an illustration of the worst case for an example of a policy space with only 2 solutions, it requires $2\times 2+1=5$ iterations to cover the whole interval $[\underline{r},\overline{r}]$ . From the proof of Theorem 1, we directly derive the following corollary about the computational complexity of Algorithm 1.

Corollary 1.

The computational complexity of Algorithm 1 is $2|\mathcal{D}|+1$ times of solving $\mathcal{M}(y)$ .

Although the above computational complexity is not attractive, it accounts for the worst case. Numerical experiments in Section 4 demonstrate that the convergence speed of Algorithm 1 is very fast in most cases.

Remark 3. By changing the update rule of $y_{l}$ in Algorithm 1, we can obtain different version of algorithms. One example is to let $y_{l+1}=\mu^{\tilde{d}^{*}(y_{l})}$ , i.e., the pseudo mean $y_{l+1}$ is set as the real mean of the optimal policy $\tilde{d}^{*}(y_{l})$ of $\mathcal{M}(y_{l})$ . Such revised algorithm is very similar to the policy iteration algorithm for solving local optimality equation in Xia (2020), both converge to a fixed point solution to (21).

Besides Lemma 4, we may further improve the shrinking efficiency of dominated areas by using other properties. From the viewpoint of bi-objective optimization in Fig. 1, we observe that the minimization of objective $\beta\sigma^{d}-\mu^{d}$ is interpreted to find the last solution $d^{*}$ tangent with the line of slope $\beta$ when the line is moving toward top-left. All the solutions located at the down-right side of this line have a worse objective $\beta\sigma^{d}-\mu^{d}$ than that of the solution $d^{*}$ . This fact is illustrated by Fig. 5.

Therefore, based on an optimal policy $\tilde{d}^{*}$ by solving the pseudo mean-variance MDP $\mathcal{M}(y)$ , we directly derive the following lemma about the shrinkage of dominated areas.

Lemma 5.

*For any policy $\tilde{d}^{*}\in\mathcal{D}$ , the policies in the following areas are dominated by $\tilde{d}^{*}$ and can be discarded:

① any policy $d\in\mathcal{D}$ satisfying $\mu^{d}\in(-\infty,\ \mu^{\tilde{d}^{*}}-\beta\sigma^{\tilde{d}^{*}}]$ ;

② any policy $d\in\mathcal{D}$ satisfying $\mu^{d}\in[\mu^{\tilde{d}^{*}}-\beta\sigma^{\tilde{d}^{*}},\ +\infty)$ and $\sigma^{d}\in[\sigma^{\tilde{d}^{*}}+\frac{1}{\beta}(\mu^{d}-\mu^{\tilde{d}^{*}}),\ +\infty)$ .*

The area ① in Lemma 5 is similar to the area discarded by Lemma 4, both are square areas and have no constraints on variances. Therefore, we can utilize the rule ① in Lemma 5 to speed up the domain shrinking of $\mathbb{Y}$ in Algorithm 1. That is, at line 7 of Algorithm 1, we can add an extra operation to discard the area ① indicated by Lemma 5:

[TABLE]

We call such algorithm revision Algorithm 1-Plus, whose performance is compared in our numerical experiments in Section 4. For example, in Fig. 4, when $\beta$ is relatively small, the Iteration 5 will be saved if we apply (26) for policy $\tilde{d}^{*}(y_{1})$ at the Iteration 2. This demonstrates that Algorithm 1-Plus is computationally saving compared with Algorithm 1.

Remark 4. It is easy to verify that all the results in this paper can be extended to solely minimizing the steady-state variance of MDPs. One trivial method is to let the coefficient $\beta$ in (4) large enough to approximate the variance minimization of MDPs. Actually, if we replace the mean-variance objective $\beta\sigma^{d}-\mu^{d}$ in (4) with the variance $\sigma^{d}$ , we can rigorously prove that all the previous results hold for this variance minimization problem. Algorithm 1 also works to find the optimal policy that attains the global minimum of the variance in MDPs.

4 Numerical Experiments

In this section, we validate the proposed algorithms with a multi-period inventory control problem, where we consider both the steady-state mean and variance of rewards. This problem is modeled as an infinite-horizon discrete-time undiscounted MDP. The inventory capacity is $C\in\mathbb{N}^{+}$ . At each epoch $t=0,1,\dots$ , the inventory level $s_{t}\in\mathcal{S}=\{0,1,\dots,C\}$ is reviewed and an order $a_{t}\in\mathcal{A}(s_{t})=\{0,\dots,C-s_{t}\}$ is made. The demands $\xi_{t}\sim B(C,p)$ are independent and identically distributed, where $B(C,p)$ is a binomial distribution and $p$ is the probability of success. There is no lead time and the next inventory level is determined as $s_{t+1}=[s_{t}+a_{t}-\xi_{t}]^{+}$ . The reward function is $r(s_{t},a_{t})=\mathbb{E}[r(s_{t+1}|s_{t},a_{t})]=-\mathbb{E}\{ba_{t}+hs_{t+1}+l[\xi_{t}-s_{t}-a_{t}]^{+}\}$ , where $b,h$ and $l$ are ordering, holding and shortage costs per unit, respectively. By default, we set $C=4,p=0.6,b=1,h=0.7,l=2.9,$ and $\beta=10$ . We run algorithms 50 replications for statistical analysis.

Fig. 6 illustrates an example of the convergence process of Algorithm 1, where the interval $[\underline{r},\overline{r}]$ is covered iteratively and the global optimum is found after only 6 iterations. This demonstrates the efficiency of Algorithm 1, although the policy space is large as $|\mathcal{D}|=(C+1)!$ .

As a comparison, we also implement the local optimization algorithm proposed by Xia (2020). Considering that the mean-variance optimization of this problem usually has multiple local optima, we illustrate the performance comparison of these two algorithms in Fig. 7, where different problem sizes $C\in\{4,7,10,20,30,50\}$ are used. We can see that our global algorithm has much better performance and the local algorithm by Xia (2020) may converge to different local optima shown by the whiskers of standard deviations.

Fig. 8 shows the curves of optimal pseudo mean-variance $\tilde{\eta}^{*}(y)$ with respect to the pseudo mean $y$ . For capacity $C=4$ , the global optimum is $\eta^{*}=4.500$ and the other two local optima are 5.376 and 6.382, which coincide with the left pair of bars in Fig. 7. The pseudo mean corresponding to $\eta^{*}$ is $y^{*}=-3.891$ , which also equals the mean of the star point in the last subfigure of Fig. 6. All these demonstrate that our Algorithm 1 truly finds the global optimum and the local algorithm by Xia (2020) randomly converges to different local optima. Moreover, when the capacity increases, the curve of $\tilde{\eta}^{*}(y)$ has more local optima and the local algorithm is more possibly trapped in a worse local optimum. This also explains the big performance gaps in Fig. 7 when the capacity is large.

Furthermore, we study the effect of risk coefficient $\beta$ on the curve $\tilde{\eta}^{*}(y)$ , as illustrated in Fig. 9. We observe that the problem complexity is increasing with respect to $\beta$ . When $\beta$ is small, the curve has only a single local optimum, which indicates that the problem is easy to solve. This is because a mean-variance problem with a small $\beta$ is approximately equivalent to only optimizing the mean performance, which is a standard MDP easy to solve. Oppositely, when $\beta$ is large, the curve has multiple local optima and the associated optimization problem is difficult to solve.

Finally, we study the effect of Lemma 5 on the algorithm efficiency. We compare the performance difference between Algorithm 1 and Algorithm 1-Plus under different capacities and $\beta$ ’s. We observe that Algorithm 1-Plus can achieve a significant efficiency improvement when the problem size (capacity) is large, as shown in Fig. 10(a). When $\beta$ is changed, there are three cases as shown in Fig. 10(b):

When $\beta$ is relatively small ( $\leq 0.1$ ), the variance is trivial, and the mean-variance optimization is approximately equivalent to a mean optimization problem which is a standard MDP. The problem is relatively easy, and these two algorithms have similar efficiency; 2. 2.

When $\beta$ is relatively large ( $\geq 100$ ), Lemma 5 may rarely remove areas with means smaller than $\underline{r}$ , which can be illustrated by the intercept in Fig. 5 when the line slope is large. Thus, these two algorithms also have similar efficiency in this case; 3. 3.

In other cases, Lemma 5 significantly improves the algorithm convergence speed, and Algorithm 1-Plus is quite more efficient than Algorithm 1.

5 Discussion and Conclusion

This paper proposes the global algorithms for solving multi-period mean-variance optimization in the framework of MDPs, which is a long-standing challenge caused by the failure of dynamic programming. We convert this problem to a bilevel MDP formulation, where the inner optimization is a standard MDP $\mathcal{M}(y)$ for pseudo mean-variance optimization and the outer one is a single parameter selection problem optimizing pseudo mean $y$ . Interestingly, the optimal value of $\mathcal{M}(y)$ is a convex piecewise quadratic function of $y$ . By the square form difference between the real variance and the pseudo variance, we discover policy dominance properties to help remove worse policy spaces iteratively. The global optimum can be found by repeatedly removing these dominated policy spaces. The convergence and efficiency of our algorithms are studied both theoretically and experimentally.

Our work demonstrates a promising approach to globally optimize the steady-state mean-variance metrics in undiscounted MDPs. It is meaningful to further extend our approach to mean-variance optimization of discounted MDPs. Another interesting topic is to develop reinforcement learning algorithms based on our global optimization approach, which can make our approach implementable in a data-driven environment.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bisi et al. (2020) Bisi, L., Sabbioni, L., Vittori, E., Papini, M., and Restelli, M. (2020). Risk-averse trust region optimization for reward-volatility reduction. Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’2020) , Special Track on AI in Fin Tech, 4583-4589.
2Borkar (2010) Borkar, V. (2010). Learning algorithms for risk-sensitive control. Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems (MTNS’2010) , July 5-9, 2010, Budapest, Hungary, 1327-1332.
3Cao (2007) Cao, X. R. (2007). Stochastic Learning and Optimization – A Sensitivity-Based Approach . New York: Springer.
4Chung (1994) Chung, K. J. (1994). Mean-variance tradeoffs in an undiscounted MDP: The unichain case. Operations Research 42, 184-188.
5Cui et al. (2022) Cui, X. Y., Gao, J. J., Li, X., and Shi, Y. (2022). Survey on multi-period mean-variance portfolio selection model. Journal of the Operations Reserach Society of China 10, 599-622.
6Dai et al. (2021) Dai, M., H. Jin, S. Kou, Y. Xu. (2021). A dynamic mean-variance analysis for log returns. Management Science 67(2), 1093-1108.
7Filar and Lee (1985) Filar, J. A. and Lee, H. M. (1985). Gain/variability tradeoffs in undiscounted Markov decision processes. Proceedings of the 24th IEEE Conference on Decision and Control (CDC’1985) , 1106-1112.
8Gal and Greenberg (1997) Gal, T. and Greenberg, H. J. (1997). Advances in Sensitivity Analysis and Parametric Programming . Kluwer, Dordrecht.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Global Algorithms for Mean-Variance Optimization in Markov Decision Processes

Abstract

1 Introduction

2 Problem Formulation

3 Main Results

Lemma 1** (Bilevel MDP).**

Lemma 2** (Critical points).**

Proof.

Lemma 3**.**

Lemma 4** (Policy dominance).**

Proof.

Theorem 1**.**

Proof.

Corollary 1**.**

Lemma 5**.**

4 Numerical Experiments

5 Discussion and Conclusion

Lemma 1 (Bilevel MDP).

Lemma 2 (Critical points).

Lemma 3.

Lemma 4 (Policy dominance).

Theorem 1.

Corollary 1.

Lemma 5.