Computing monotone policies for Markov decision processes: a   nearly-isotonic penalty approach

Robert Mattila; Cristian R. Rojas; Vikram Krishnamurthy; Bo; Wahlberg

arXiv:1704.00621·cs.SY·April 4, 2017

Computing monotone policies for Markov decision processes: a nearly-isotonic penalty approach

Robert Mattila, Cristian R. Rojas, Vikram Krishnamurthy, Bo, Wahlberg

PDF

Open Access

TL;DR

This paper introduces a novel two-stage convex optimization method leveraging nearly-isotonic regularization to efficiently compute monotone policies in Markov decision processes, significantly accelerating the solution process.

Contribution

It proposes a new alternating convex optimization scheme that exploits monotonicity in MDPs using nearly-isotonic regression, enhancing computational efficiency.

Findings

01

ADMM can be significantly accelerated with the regularization step.

02

The proposed method outperforms traditional approaches in numerical simulations.

03

Monotone policies can be efficiently computed using the two-stage scheme.

Abstract

This paper discusses algorithms for solving Markov decision processes (MDPs) that have monotone optimal policies. We propose a two-stage alternating convex optimization scheme that can accelerate the search for an optimal policy by exploiting the monotone property. The first stage is a linear program formulated in terms of the joint state-action probabilities. The second stage is a regularized problem formulated in terms of the conditional probabilities of actions given states. The regularization uses techniques from nearly-isotonic regression. While a variety of iterative method can be used in the first formulation of the problem, we show in numerical simulations that, in particular, the alternating method of multipliers (ADMM) can be significantly accelerated using the regularization step.

Tables1

Table 1. Table 1 : Influence of the parameter ρ 𝜌 \rho on the number of iterations needed to definitively reach the tolerance in cost and feasibility. The tolerance on the relative error in cost was 1% and the tolerance on the residual was 10 − 4 superscript 10 4 10^{-4} .

	Plain ADMM		Proposed method
$ρ$	${‖ r^{(n)} ‖}_{\infty} < ε_{r}$	$\frac{\| c^{(n)} - c^{} \|}{c^{}} < ε_{c}$	${‖ r^{(n)} ‖}_{\infty} < ε_{r}$	$\frac{\| c^{(n)} - c^{} \|}{c^{}} < ε_{c}$
0.1	$>$ 250	191	$>$ 250	$>$ 250
1.0	94	50	137	76
5.0	68	22	71	31
10.0	82	31	94	21
20.0	118	64	70	27
30.0	169	96	77	28
40.0	212	128	78	31
50.0	246	160	79	31
60.0	$>$ 250	192	93	31
70.0	$>$ 250	224	101	31
80.0	$>$ 250	$>$ 250	116	31
90.0	$>$ 250	$>$ 250	129	31
100.0	$>$ 250	$>$ 250	143	31

Equations124

max {0, f (x) - f (x + 1)},

max {0, f (x) - f (x + 1)},

P_{ij} (u, k) = Pr [x_{k + 1} = j ∣ x_{k} = i, u_{k} = u],

P_{ij} (u, k) = Pr [x_{k + 1} = j ∣ x_{k} = i, u_{k} = u],

μ^{*} = μ arg min J_{μ} (x),

μ^{*} = μ arg min J_{μ} (x),

J_{\bm{\mu}}(x)={\mathbb{E}}\Bigg{\{}\sum_{k=0}^{N-1}c(x_{k},u_{k},k)+c_{N}(x_{N})\big{|}x_{0}=x\Bigg{\}}

J_{\bm{\mu}}(x)={\mathbb{E}}\Bigg{\{}\sum_{k=0}^{N-1}c(x_{k},u_{k},k)+c_{N}(x_{N})\big{|}x_{0}=x\Bigg{\}}

\tilde{\mu}_{k}(x)=\mathbb{E}\big{\{}u_{k}|x_{k}=x\big{\}}

\tilde{\mu}_{k}(x)=\mathbb{E}\big{\{}u_{k}|x_{k}=x\big{\}}

\mathbb{E}\Big{\{}\sum_{k=0}^{N}\beta_{l}(x_{k},u_{k},k)\Big{\}}\leq\gamma_{l}\quad\text{ for }l=1,\dots,L,

\mathbb{E}\Big{\{}\sum_{k=0}^{N}\beta_{l}(x_{k},u_{k},k)\Big{\}}\leq\gamma_{l}\quad\text{ for }l=1,\dots,L,

\allowdisplaybreaks π \in R^{X U (N + 1)} min

\allowdisplaybreaks π \in R^{X U (N + 1)} min

\displaystyle\hskip 52.63777pt+c_{N}(x)\pi(x,u,N)\Big{\}}

u \in U \sum π (x, u, 0) = I {x = x_{0}} for x \in X,

u \in U \sum π (j, u, k) = i \in X \sum u \in U \sum P_{ij} (u, k) π (i, u, k - 1)

for j \in X, k = 1, 2, \dots, N,

π (x, u, k) \geq 0 for x \in X, u \in U, k = 0, 1, \dots, N,

x \in X \sum u \in U \sum k = 0 \sum N π (x, u, k) β_{l} (x, u, k) \leq γ_{l}

for l = 1, 2, \dots, L .

π (x, u, k) = Pr [x_{k} = x, u_{k} = u] .

π (x, u, k) = Pr [x_{k} = x, u_{k} = u] .

u_{k}^{*} (x) = u with probability θ (x, u, k),

u_{k}^{*} (x) = u with probability θ (x, u, k),

θ (x, u, k) = \frac{π ( x , u , k )}{\sum _{\overset{u}{ˉ} \in U} π ( x , u ˉ , k )} .

θ (x, u, k) = \frac{π ( x , u , k )}{\sum _{\overset{u}{ˉ} \in U} π ( x , u ˉ , k )} .

\tilde{μ}_{k} (x)

\tilde{μ}_{k} (x)

= u = 1 \sum U u θ (x, u, k)

= [12 \dots U] θ (x, :, k) .

\sum_{x=1}^{X-1}\big{\{}\tilde{\mu}_{k}(x)-\tilde{\mu}_{k}(x+1)\big{\}}_{+}

\sum_{x=1}^{X-1}\big{\{}\tilde{\mu}_{k}(x)-\tilde{\mu}_{k}(x+1)\big{\}}_{+}

α min

α min

s.t.

α \geq 0,

[ρ I A A^{T} 0] [α^{(n + 1)} ν] + [q - ρ (z^{(n)} - η^{(n)}) - b] = 0,

[ρ I A A^{T} 0] [α^{(n + 1)} ν] + [q - ρ (z^{(n)} - η^{(n)}) - b] = 0,

z^{(n + 1)}

z^{(n + 1)}

η^{(n + 1)}

r^{(n)} = α^{(n)} - z^{(n)} .

r^{(n)} = α^{(n)} - z^{(n)} .

\tilde{μ}_{k} (x)

\tilde{μ}_{k} (x)

= u = 1 \sum U u θ (x, u, k)

= [12 \dots U] θ (x, :, k),

λ

λ

λ

p (x, k) = Pr [x_{k} = x]

p (x, k) = Pr [x_{k} = x]

p (x, k)

p (x, k)

= \overset{u}{ˉ} \in U \sum Pr [x_{k} = x, u_{k} = \overset{u}{ˉ}]

= \overset{u}{ˉ} \in U \sum π (x, \overset{u}{ˉ}, k),

π (x, u, k)

π (x, u, k)

= Pr [u_{k} = u ∣ x_{k} = x] Pr [x_{k} = x]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms

Full text

Computing monotone policies for Markov decision processes: a nearly-isotonic

penalty approach

Robert Mattila

Cristian R. Rojas

Vikram Krishnamurthy

Bo Wahlberg

Department of Automatic Control, School of Electrical Engineering, KTH Royal Institute of Technology. Stockholm, Sweden.

(e-mails: [email protected], [email protected], [email protected]).

Department of Electrical and Computer Engineering, Cornell University. Cornell Tech, NY, USA. (e-mail: [email protected])

Abstract

This paper discusses algorithms for solving Markov decision processes (MDPs) that have monotone optimal policies. We propose a two-stage alternating convex optimization scheme that can accelerate the search for an optimal policy by exploiting the monotone property. The first stage is a linear program formulated in terms of the joint state-action probabilities. The second stage is a regularized problem formulated in terms of the conditional probabilities of actions given states. The regularization uses techniques from nearly-isotonic regression. While a variety of iterative method can be used in the first formulation of the problem, we show in numerical simulations that, in particular, the alternating method of multipliers (ADMM) can be significantly accelerated using the regularization step.

keywords:

stochastic control, Markov decision process (MDP), $l_{1}$ -regularization, sparsity, monotone policy, alternating direction method of multipliers (ADMM), isotonic regression

††thanks: This work has been accepted for presentation at the 20th World Congress of the International Federation of Automatic Control, 9-14 July 2017. This work was partially supported by the Swedish Research Council under contract 2016-06079 and the Linnaeus Center ACCESS at KTH.

1 Introduction

Supermodularity conditions that ensure that a Markov decision process (MDP) has a monotone optimal policy have been studied widely (see, e.g., Puterman (1994) or Krishnamurthy (2016), and references therein). In particular, such monotone policies provide a sparse characterization of the optimal policy when the action space is small and the state-space is large. Computing the optimal policy for such an MDP is computationally demanding; for example, the value iteration algorithm involves $\mathcal{O}(X^{2}U)$ computations per iteration, where $X$ and $U$ are the number of states and actions, respectively. We aim to exploit structural properties of this type of MDP to reduce the computational burden and hence accelerate the search for an optimal policy.

It was noted in Krishnamurthy et al. (2013) that a monotone policy has a piecewise constant structure. In the case of infinite-horizon MDPs, it was shown how the search for a stationary optimal policy could be significantly accelerated by means of techniques from sparse estimation (in particular, LASSO techniques – see, e.g., Hastie et al. (2013)). In this paper, we build upon this work but with two important differences: firstly, we generalize to finite-horizon MDPs, for which it is even more important to exploit sparsity to reduce the computational cost since they have non-stationary policies, and secondly, we exploit the monotonicity explicitly, instead of only implicitly through the piecewise constant property of monotone policies.

The search for an optimal policy can be formulated as a linear program (LP). This LP can be solved in various iterative ways. The key idea in this paper is to accelerate the iterative search by using the recently proposed nearly-isotonic regression technique by Tibshirani et al. (2011). Roughly, a penalty is attached to non-monotone iterates by means of adding an $l_{1}$ rectifier-like regularizer to the cost function:

[TABLE]

where $x$ is the iterate and $f$ is a monotone function of $x$ .

This promotes monotonicity in the iterates because, intuitively, the regularization term modifies the cost surface to be more steep in the direction of monotone policies – where the optimum is located. Unfortunately, including this regularizer in the LP is, first of all, not straight-forward and secondly, it yields a non-linear, possibly non-convex, problem. We perform a relaxation of the regularized problem to obtain a convex problem. Our method allows us to swap between the original LP formulation and the regularized formulation, and thus take advantage of both by alternating between which one is used to update the iterate: the original LP formulation guarantees convergence to the global optimum and the regularized convex formulation accelerates the search by exploiting monotonicity.

The main contributions of this paper are three-fold:

•

We show in numerical simulations that the number of iterations needed to converge can be vastly reduced using the regularization. This can yield a significant speed-up for large-scale systems where each iteration is time-consuming to calculate.

•

The benefits are shown to be larger when a tuning parameter in the algorithm used to iteratively solve the LP is not chosen optimally. Since the optimal choice of this parameter is not known a priori, it provides robustness to the search algorithm.

•

Even though we generalize to finite-horizon MDPs, the method is directly applicable to infinite-horizon MDPs, and the regularizer is stronger (monotone) than the one in Krishnamurthy et al. (2013) (piecewise constant). It can as such be seen as a direct improvement.

The outline of the paper is as follows. We present preliminaries related to MDPs in Section 2, and then proceed with a discussion of the objective as well as related work in Section 3. Section 4 presents the algorithm. Conditions for an MDP to have monotone structure, as well as real-world examples, are presented along with numerical simulations in Section 5. The paper is then concluded with a brief summary and indications for future work in Section 6.

2 Preliminaries

We let $\text{I}\{\cdot\}$ denote the indicator function. For a matrix $A$ , define $A(i,:)$ to be the $i$ th row and $A(:,j)$ to be the $j$ th column. We use the corresponding slicing notation for higher order arrays. Let $\{x\}_{+}=\max\{0,x\}$ denote the positive part of a number $x$ . In this paper, we use the words monotone, decreasing and increasing in the weak sense, e.g., increasing means non-decreasing. The $l_{\infty}$ -norm of a vector $v$ is $\|v\|_{\infty}=\max_{k}|v_{k}|$ , and $\|v\|_{2}$ denotes the standard Euclidean norm.

2.1 Markov decision processes

Let $k=0,1,\dots,N$ denote discrete time. A Markov decision process (MDP) is a controlled Markov chain with state-space $\mathcal{X}=\{1,2,\dots,X\}$ and state $x_{k}\in\mathcal{X}$ at time $k$ . It is controlled in the sense that the transition matrices

[TABLE]

are functions of time $k$ and action $u\in\mathcal{U}=\{1,2,\dots,U\}$ . Associated with every state $i$ , action $u$ , and time $k$ is an immediate cost $c(i,u,k)$ . We consider a time horizon of length $N$ and assume that the terminal cost $c(i,u,N)=c_{N}(i)$ is independent of action. The aim of the MDP is to find a policy $\bm{\mu}=\{\mu_{0},\mu_{1},\dots,\mu_{N-1}\}$ , where each $\mu_{k}$ is a mapping from the state-space to a (possibly degenerate) probability distribution over the action set. In particular, the sought policy is an optimal policy, i.e., one such that

[TABLE]

for all initial states $x$ , where

[TABLE]

is the finite-horizon objective (expected cumulative cost incurred by $\bm{\mu}$ ), and $u_{k}$ is distributed according to $\mu_{k}(x_{k})$ .

A policy is said to be deterministic at time $k$ if the probability distribution induced by $\mu_{k}$ on the action space for each state $x$ is degenerate, i.e., the probability mass in concentrated on one action. A policy is said to be monotone at time $k$ if the function

[TABLE]

is monotone in $x$ .111In this paper, we consider only monotonically increasing policies, and will use the words monotone and increasing interchangeably. Note that this reduces to the standard definition when considering deterministic policies. An MDP is said to have a monotone optimal policy if there is an optimal policy that is monotone for each time $k$ .

Motivated by problems in telecommunications and safety critical planning, see, e.g., Altman (1999), Krishnamurthy (2016) and El Chamie et al. (2016), we allow for average-type constraints in the problem:

[TABLE]

where the $L$ functions $\beta_{l}(x,u,k)$ and thresholds $\gamma_{l}$ are given. We refer to solving (3), subject to the constraints (6), as the constrained case when $L>0$ .

The search for an optimal policy, i.e., problem (3) (with or without the constraints (6)), can be approached in different ways – see, e.g., Puterman (1994) or Krishnamurthy (2016). One possibility is to formulate the optimality conditions as a linear program (LP). This has the benefit of facilitating sensitivity analysis of the obtained solution, and also facilitating the inclusion of constraints, such as (6), in the problem.

Assume $x_{0}$ to be the initial state of the MDP. Then an optimal policy can be found using the following LP222Although in classical textbooks, infinite-horizon MDPs are solved via linear programming, it is straightforward to formulate the solution of a finite-horizon MDP as an LP. (see, e.g., (Feinberg et al., 2002, Chapter 12) for details):

[TABLE]

In this formulation, $\pi$ is an occupation measure, namely:

[TABLE]

The associated policy $\bm{\mu}^{*}$ is

[TABLE]

where the conditional probabilities $\theta(x,u,k)=\operatorname{Pr}[u_{k}=u|x_{k}=x]$ can be calculated as

[TABLE]

Remark 1

It should be noted that in terms of these variables, we can re-write the function defining monotonicity, i.e., equation (5), as

[TABLE]

3 Problem Formulation and Related Work

If it is a priori known that an MDP has an optimal policy $\bm{\mu}^{*}$ that is monotone – see Section 5 for examples where this holds – then the question we aim to answer in this paper is: how can we efficiently exploit the structure to find $\bm{\mu}^{*}$ ?

The problem is of most interest when considering large-scale MDPs. We first note that a direct search for an optimal policy over the space of monotone policies (which is vastly smaller than the complete policy space) is, in the case of infinite-horizon MDPs, a combinatorial search over ${X+U-1\choose U-1}$ stationary policies. In the finite-horizon case, this increases to ${X+U-1\choose U-1}^{N}$ non-stationary policies. This quickly becomes prohibitively large.

Work on large-scale MDPs that does not explicitly take monotonicity into account include approximate dynamic programming (ADP), where, e.g., the optimal value function $J_{\bm{\mu}^{*}}$ is approximated by a linear expansion in some terms of some basis functions, and the related neuro-dynamic programming. See, e.g., Bertsekas (2007), de Farias and Van Roy (2003), and Bertsekas and Tsitsiklis (1995).

In the recent work by Fu et al. (2015), block-splitting methods (that are based on the alternating direction method of multipliers, ADMM, which we also use in this paper) are employed to solve large-scale MDPs by means of decomposing the problem into sub-problems that can be solved in a distributed fashion. We believe that their work could propitiously be used in conjunction with the work presented in this paper (for monotone MDPs).

Jiang and Powell (2015) provide an extensive review of real-world applications of MDPs with monotone value functions, along with a method based on ADP that exploits the monotonocity of the value function. Our method in comparison promotes the monotonicity directly in the policy space.

In Ngo and Krishnamurthy (2010), and see also Krishnamurthy (2016), it was proposed that the problem of finding an optimal monotone policy can be relaxed by approximating the optimal policy by a continuous representation based on sigmoidal functions. The search for an optimal policy can then be approached using simulation based stochastic optimization.

The most closely related work is Krishnamurthy et al. (2013). There, it was proposed how the monotonicity of an optimal policy, in the infinite-horizon case, can be exploited. Since the action set is finite, the number of jumps that the policy can make (as a function of state) is limited to at most $U-1$ . This implies that the policy is sparse in the number of jumps. It is natural to exploit this structure by using methods from sparse estimation. In particular, the (fused) group LASSO by Yuan and Lin (2006) was employed. This, however, promotes only a piecewise constant structure in the policy – not necessarily monotonicity, which is explicitly promoted in this work. Also, the finite-horizon setup that is considered here results in a much larger problem since the policy is non-stationary and hence it has more decision variables (a factor $N$ ) in the corresponding LP.

4 Isotonic Regularization for Monotone MDPs

As mentioned above, the key point is that for MDPs with large state space and small action space, a monotone policy is sparse. We use an iterative optimization algorithm to solve problem (7) by exploiting sparsity. Assuming that we know that there exists an optimal policy that is monotone (see Section 5 for conditions and examples), we employ the idea from Tibshirani et al. (2011), but in a regularization setting.

The key idea is to add a rectified $l_{1}$ -penalty of the form

[TABLE]

to the cost in the optimization problem – since the function $\tilde{\mu}_{k}(x)$ is assumed to be (monotonically) increasing in $x$ at the optimum. Intuitively, this will modify the cost-surface to be more steep in the direction of monotone policies – resulting in faster convergence of the iterative optimization algorithm.

However, there are difficulties performing this regularization. The main difficulty is that it is not possible to directly add the term (12) in the original LP (7). Adding it involves a change of variables ((10)-(11)) that turns the problem into a non-linear, and possibly non-convex, problem. Our approach is an alternating optimization scheme, where we switch between updating the iterate on the globally convergent LP formulation and the regularized problem, and hence, exploit the benefits of both formulations. We will show below the details of these two steps and how it is possible to alternate between the two formulations.

4.1 Linear program update

There are several ways to iteratively solve an LP such as (7), see, e.g., Luenberger and Ye (2008). The accelerating regularization technique that we demonstrate in this paper is applicable to any iterative method where the iterates are not restricted to the vertices of the feasible domain. The goal of the regularization is to decrease the number of iterations needed until the iterates converge to an optimal monotone solution.

This motivates our choice to use the alternating direction method of multipliers (ADMM), see Boyd et al. (2011), which is a popular method to solve large-scale optimization problems. Second-order methods (such as interior-point methods) often converge in few iterations to very high accuracy. However, for very large problems, even a single iteration of an interior point method might be computationally infeasible. In comparison, first-order methods and ADMM converge using a higher number of cheap iterations.

The ADMM update equations for LPs have been derived in Boyd et al. (2011). To utilize these, we first put the problem on standard LP form. It is straight-forward to re-write problem (7) using matrix-vector notation as

[TABLE]

where $\alpha$ is a vectorized version of the decision variable $\pi$ , and $q$ , $A$ and $b$ follow from the cost and the constraints. In terms of these variables, an ADMM update (from iteration $n$ to $n+1$ ) is obtained by first solving the set of linear equations

[TABLE]

where $\rho>0$ is the tuning parameter of ADMM, and then updating the dual variables as

[TABLE]

Remark 2

Another reason for using ADMM is apparent here: it has only one tuning parameter, namely $\rho$ . ADMM is moreover very generous in terms of convergence with respect to this parameter, in fact, under mild conditions, it is convergent for any choice of $\rho$ , albeit the performance may vary – this is explored in the numerical examples in Section 5.

In terms of the ADMM variables, the primal residual of the LP is

[TABLE]

This is a measure of how feasible the current iterate is.

4.2 Isotonic regularization

Recall from Section 2.1 that when an MDP has an optimal monotone policy, the scalar function from equation (5),

[TABLE]

is monotonically (increasing) in state $x$ for each time $k$ (if $\theta$ corresponds to an optimal policy).

A natural choice of regularization to include in the problem is thus the following penalty introduced in a general regression setting in Tibshirani et al. (2011):

[TABLE]

where $\lambda$ is the regularization weight. This adds a penalty whenever the iterate, i.e., the policy, is not monotone.

The main problem is that this regularization term is naturally formulated in terms of the conditional probabilities $\theta(x,u,k)$ , rather than the joint occupation probabilities $\pi(x,u,k)$ , in which the LP (7) is formulated. To deal with this, we reformulate problem (7) as an equivalent problem using the marginalized state probabilities and the conditionals $\theta(x,u,k)$ . The key is to derive a way to swap between these two formulations.

To do this, introduce

[TABLE]

as the state distribution at time $k$ . The relations we need to be able to change formulation are equation (10), and the following two relations;

[TABLE]

and

[TABLE]

In terms of $p$ and $\theta$ , problem (7) with an included regularization term (19) reads

[TABLE]

In order to simplify the regularization update step, we i) assume $p$ to be fixed, ii) drop the redundant constraints, and iii) relax the constraints related to the initial distribution, the state transitions and the average-type constraints (6). This allows for more flexibility in the regularized update – note that these will anyway be enforced later in the original LP formulation. This yields the relaxed problem

[TABLE]

4.3 Regularized subgradient step

Again, the idea is that the regularization will promote monotonicity by increasing the slope in the direction of monotone policies (where the optimum is located). However, the simplifications done to arrive at problem (24) probably shift the minimum of the optimization problem away from the original minimum in problem (7). For this reason, we need to return to the original (globally convergent) formulation and have the effect of the regularization step diminish over time.

Therefore, and due to the non-smooth objective function, we employ the subgradient method, see Nesterov (2004). The nominal problem for the subgradient method is

[TABLE]

where $f$ is a cost function, $Q$ is a convex set and $\bar{f}$ is an inequality constraint function. In our case, compare with problem (24), we have that the decision variables $\beta$ are the conditionals $\theta$ , $f$ is the regularized cost function, $Q$ are simplices for slices of $\beta$ , and $\bar{f}$ is a negative equality mapping.

Denote a subgradient of the cost function $f$ as $g$ and a subgradient of the inequality constraint function $\bar{f}$ as $\bar{g}$ . The subgradient method consists of the following two steps. At iteration $n$ ,

Compute $f(\beta^{(n)})$ , $g(\beta^{(n)})$ , $\bar{f}(\beta^{(n)})$ and $\bar{g}(\beta^{(n)})$ and set

[TABLE] 2. 2.

Set

[TABLE]

where $\pi_{Q}$ is the Euclidean projection on $Q$ , and $R$ is an upper bound on the diameter of the set $Q$ : $\|\beta_{1}-\beta_{2}\|_{2}\leq R,\;\forall\beta_{1},\beta_{2}\in Q$ . Note that the step-size is decreasing in time, and hence the effect of the regularization, exactly as we wanted.

Remark 3

In terms of the variables of our problem, an upper bound $R$ can be found explicitly, since

[TABLE]

for all $\theta_{1}$ and $\theta_{2}$ fulfilling the simplex constraint (for each fixed pair of $x$ and $k$ ). We thus take $R=\sqrt{2X(N+1)}$ .

For explicit expressions of the subgradients, see the calculations in Appendix A.

4.4 Summary of algorithm

The following scheme illustrates the algorithm:

[TABLE]

where LP is problem (7), the regularized non-linear problem (NLP) is problem (23) and the relaxed problem (RP) is problem (24). The algorithm first performs $i_{\text{ADMM}}$ ADMM updates on the LP using equations (14), (15) and (16). It then translates the problem to the regularized NLP, using relations (10) and (21), and relaxes it to obtain the RP. In this formulation, $i_{\text{SG}}$ subgradient steps are taken in the $\theta$ variable using equations (26) and (27). This could be interpreted as a sequential minimization333See (Boyd and Vandenberghe, 2004, p. 133)., however, instead of performing the subsequent minimization over $p$ , we translate back to the original LP and repeat.

The convergence of the algorithm is guaranteed by the following theorem.

Theorem 1

The iterates obtained using the algorithm (29) will converge to an optimal policy.

Proof (outline): Instead of providing a formal proof of the theorem, we give the following heuristic argument. The LP is globally convergent and the effect of the subgradient steps in the RP is diminishing over time (due to the iteration dependent step-size). Hence, after a certain number of iterations, the effect of the subgradient steps will be negligible and the ADMM steps on the LP will converge due to guarantees on convergence for ADMM (see Boyd et al. (2011)). $\square$

It should be noted that after a certain point in time, the subgradient updates will be pure delays in the algorithm (since the step-size is essentially zero). Hence, it could be motivated to switch to using plain ADMM after a pre-defined number of iterations and only use the proposed method as an initial boost. This would reduce the proof to simply convergence of plain ADMM on an LP.

5 Examples

In this section, we present conditions and several examples of MDPs that have monotone optimal policies. We also provide numerical simulations illustrating the performance of the proposed algorithm.

5.1 Markov decision processes with monotone policies

We start by stating formal conditions which guarantee the existence of a monotone optimal policy. The following result and four assumptions are well-known, see, e.g., Krishnamurthy (2016) or Puterman (1994):

(A1)

Costs $c(x,u,k)$ are decreasing in $x$ . The terminal cost $c_{N}(x)$ is decreasing in $x$ . 2. (A2)

$P_{i}(u,k)\leq_{s}P_{i+1}(u,k)$ for each $i$ and $u$ . Here $P_{i}(u,k)$ is the $i$ th row of the transition matrix for action $u$ at time $k$ and $\leq_{s}$ denotes first order stochastic dominance, that is, $\sum_{i=j}^{X}P_{i}(u,k)\leq\sum_{i=j}^{X}P_{i+1}(u,k)$ for all $j\in\mathcal{X}$ . 3. (A3)

$c(x,u,k)$ is submodular in $(x,u)$ at each that $k$ . That is, $c(x,u+1,k)-c(x,u,k)$ is decreasing in $x$ . 4. (A4)

$P_{ij}(u,k)$ is tail-sum supermodular in $(i,u)$ , i.e., $\sum_{j\geq l}(P_{ij}(u+1,k)-P_{ij}(u,k))$ is increasing in $i$ .

Note that these four conditions are easily checked. If they are satisfied, then the following structural result holds:

Theorem 2

Assume that an unconstrained finite-horizon MDP satisfies conditions (A1-4). Then there exists a monotone optimal policy.

Even though the assumptions (A1-4) might sound restrictive at first sight, a large class of real-world problems satisfies them. This is because they are often fulfilled in problems where a degradation takes place over time. To get some intuition of when the assumptions might hold, we provide the following simple, but representative, machine replacement example.

Let $\mathcal{X}=\{1,2\}$ represent the two states of a machine: 1 - broken, 2 - working. Let $\mathcal{U}=\{1,2\}$ be the two actions: 1 - replace, 2 - continue operation. Let $\theta$ be the probability of a working machine breaking down. The transition probability matrices are hence:

[TABLE]

Let $R\geq 0$ be the cost of performing a replacement (regardless of the state of the machine) and $\gamma\geq 0$ be the cost of not being able to utilize the machine because it is broken. Define the costs as

[TABLE]

It is easily checked that this system fulfills conditions (A1-4). An optimal policy corresponds to the optimal choices of when to replace the machine, depending on the current time and its current state, as to maximize the profits of the operator.

This model can be generalized to larger and more complex systems (e.g., with time-varying parameters). A recent example of this is medical treatment planning of abdominal aortic aneurysms, see Mattila et al. (2016), where the parameters are time-varying and the optimal policy is monotone.

A number of real-world examples of monotone MDPs (e.g., inventory models, queueing control, price determination and equipment replacement) can be found in Puterman (1994). Krishnamurthy (2016) provides several examples, including the constrained case, of, e.g., transmission scheduling over wireless channels. Jiang and Powell (2015) contains an extensive overview of applications in operations research, energy, healthcare, finance and economics, that have a monotone structure.

5.2 Numerical performance

To illustrate the performance of the proposed method, we generated a synthetic MDP of dimensions $X=10$ and $U=3$ by randomly sampling a system from the systems that fulfill assumptions (A1-4). The time-horizon in the MDP was set to $N=365$ . We will first discuss our rationale for our numerical choices of the four tuning parameters: $\lambda$ , $\rho$ , $i_{\text{ADMM}}$ and $i_{\text{SG}}$ .

First, the regularization parameter $\lambda$ was chosen as to approximately balance the regularization term with the current cost. In particular, it was chosen as the time-horizon times the mean (in time, state and action) of the cost function, i.e.,

[TABLE]

We run $i_{\text{ADMM}}=10$ ADMM iterations and $i_{\text{SG}}=5$ subgradient steps. Note that we cannot choose a too large value of $i_{\text{SG}}$ since $p$ is assumed to be constant in the relaxed problem (24).

It is a priori difficult to know what the optimal value of $\rho$ is. To explore the influence of $\rho$ on the problem, we solved the problem for a range of values between 0.1 and 100 – see Table 1 and the discussion below. Note that this is not a feasible approach in a real problem since one does not want to re-solve the problem. The optimal $\rho$ appears to be in the lower region of the scale, however, in practice one would perhaps end up with picking a bigger value.

The typical performance of the proposed algorithm (dashed red), as well as plain ADMM (solid blue), can be seen in Fig. 1. A slightly higher value ( $\rho=30$ ) than the optimal was chosen for $\rho$ . The cost-plot (Fig. 1(a)) shows the difference in expected cost inquired using the policy at each iteration compared to using the optimal policy. The residual-plot (Fig. 1(b)) shows the $l_{\infty}$ -norm of the primal residual which is an indication of how feasible the policy is in terms of the constraints (e.g., transitions and sum-to-one). Note that the primal-residual is formulated in terms of the ADMM variables and is not calculated when the subgradient steps are performed. The areas with gray background indicate where ADMM updates are made, and the white areas indicate where subgradient steps are taken on the regularized problem.

From Fig. 1, it is clear that the proposed algorithm steers the iterates towards the optimum, as seen by the decreases in the cost function when the regularized problem is used. In early iterations, the iterates become more infeasible when changing back to ADMM due to the simplifications done to arrive arrive at problem (24) – for example, assuming $p$ to be constant. At some point in time, switching between the two formulations can become problematic due to conditioning on highly unlikely events – c.f. equation (10). A work-around is to switch back to pure ADMM after some fixed number of iterations and use the regularized problem only as an initial boost – this is explored in Appendix B.

It is seen from Fig. 1 that roughly half the number of iterations are needed using the proposed algorithm, compared to plain ADMM. A quantitative comparison is made in Table 1. There, both plain ADMM and the proposed algorithm were run for a fixed number of iterations. The iteration number after which predefined thresholds held in both terms of cost and feasibility were recorded. In cost, we required the relative error to be less than 1%, i.e., $\frac{|c^{(n)}-c^{*}|}{c*}<\varepsilon_{c}$ , where $c^{(n)}$ is the expected cost from an initial state using the policy at iteration $n$ , $c^{*}$ is the expected cost using the optimal policy and $\varepsilon_{c}$ is the tolerance of 1%. In terms of feasibility, we put a threshold on the $l_{\infty}$ -norm of the residual as $\|r^{(n)}\|_{\infty}<\varepsilon_{r}$ , for a threshold value of $\varepsilon_{r}=10^{-4}$ .

It is apparent from Table 1 that the optimal value of $\rho$ is in the lower region of the scale — roughly, between 5.0 and 10.0. When $\rho$ is chosen optimally, the regularization step does not appear to make much of a difference in terms of convergence speed (for a value of $\rho=5.0$ , three more iterations are needed to fulfill the criterion). However, if $\rho$ is chosen suboptimally, the number of iterations needed to fulfill the convergence criterion is ranging from 59% to 32% (and less).

The proposed algorithm converges in roughly the same number of iterations for relatively high values of $\rho$ , as plain ADMM does for the optimal value of $\rho$ : 82 iterations for $\rho=10.0$ (ADMM) versus 93 iterations for $\rho=60.0$ (proposed method). This indicates that the regularization gives a robustness in the choice of the tuning parameter $\rho$ . Since the optimal value of $\rho$ is unknown to start with, using the regularization allows for a (larger) suboptimal value to be chosen without much loss in performance.

6 Conclusions

This paper has presented a method to accelerate the search for an optimal monotone policy in MDPs where such a policy is present, exploiting its inherent sparsity. A technique from the field of sparse estimation, namely, nearly-isotonic regression, was used as a regularizer to promote monotonicity in the optimization iterates. To ensure both convergence and acceleration, two problem formulations were employed: one globally convergent LP formulated in terms of occupation measures, and a relaxed regularized problem formulated in terms of conditional probabilities. Numerical simulations showed the possibility of improvement in terms of number of iterations needed for convergence when combined with a popular large-scale optimization algorithm – especially when a tuning parameter, of a priori unknown optimal value, was chosen suboptimally.

In the future, it would be of interest to consider memory efficient representations and how splitting methods could be employed for distributed computing. It would also be of interest to extend the work to partially observed MDPs.

Appendix A Computation of Subgradients

Here, we compute the subgradients needed in equation (26), i.e., when updating the iterate on the regularized problem. Compared to the nominal problem (25), we have that:

•

$\beta=\theta\in{\mathbb{R}}^{X\times U\times(N+1)}$ ,

•

$Q=\{\theta|\sum_{u}\theta(x,u,k)=1\}$ , i.e., a simplex for every pair of $x$ and $k$ ,

•

$f(\beta)=f(\theta)=f_{1}(\theta)+f_{2}(\theta)$ , where

[TABLE]

and

[TABLE]

•

$\bar{f}(\beta)=\bar{f}(\theta)=\max_{x,u,k}\{-\theta(x,u,k)\}.$

We now need to evaluate a subgradient for each one of these functions. We have that $g=g_{1}+g_{2}$ is one subgradient, where

•

$g_{1}=\frac{\partial f_{1}(\theta)}{\partial\theta_{x^{\prime},u^{\prime},k^{\prime}}}=c(x^{\prime},u^{\prime},k^{\prime})p(x^{\prime},k^{\prime})$ if $k<N$ , and $c_{N}(x^{\prime})p(x^{\prime},k^{\prime})$ otherwise.

•

[TABLE]

where the first term is only active when $x^{\prime}<X$ and the last term is only included when $x^{\prime}>1$ .

•

Let $\bar{x},\bar{u},\bar{k}$ be such that

[TABLE]

Then one subgradient is given by

[TABLE]

Appendix B Initial Boost

In Fig. 2, we consider an example where we use the proposed method only as an initial boost. In particular, we considered a random (monotone) MDP of size $X=10$ and $U=3$ over a time-horizon $N=60$ . The parameters of the method were set to $\rho=30$ , $i_{\text{ADMM}}=5$ and $i_{SG}=3$ .

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Altman (1999) Altman, E. (1999). Constrained Markov decision processes . CRC Press.
2Bertsekas (2007) Bertsekas, D.P. (2007). Dynamic Programming and Optimal Control, Vol. II . Athena Scientific, 3rd edition.
3Bertsekas and Tsitsiklis (1995) Bertsekas, D.P. and Tsitsiklis, J.N. (1995). Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control (CDC’95) , 560–564.
4Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. , 3(1), 1–122. 10.1561/2200000016 . · doi ↗
5Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge University Press, New York, NY.
6de Farias and Van Roy (2003) de Farias, D.P. and Van Roy, B. (2003). The linear programming approach to approximate dynamic programming. Operations Research , 51(6), 850–865.
7El Chamie et al. (2016) El Chamie, M., Yu, Y., and Açıkmeşe, B. (2016). Convex synthesis of randomized policies for controlled Markov chains with density safety upper bound constraints. In Proceedings of the American Control Conference (CDC’16) , 6290–6295.
8Feinberg et al. (2002) Feinberg, E.A., Shwartz, A., and Hillier, F.S. (eds.) (2002). Handbook of Markov Decision Processes . Springer, New York, NY.