A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion
F. Dufour (CQFD), Alexandre Genadot (CQFD)

TL;DR
This paper introduces a convex programming approach for constrained discrete-time Markov decision processes with Borel spaces, establishing the equivalence of optimal values and policies under the expected total reward criterion.
Contribution
It formulates a convex programming model for constrained MDPs with Borel spaces and proves the existence of stationary optimal policies under weak assumptions.
Findings
Convex programming formulation matches the constrained MDP's optimal value.
Existence of stationary randomized policies for optimal solutions.
Supremum of expected total rewards over randomized policies equals that over stationary policies.
Abstract
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion
F. Dufour
Institut Polytechnique de Bordeaux
INRIA Bordeaux Sud Ouest, Team: CQFD
IMB, Institut de Mathématiques de Bordeaux, Université de Bordeaux, France
e-mail: [email protected]
A. Genadot
IMB, Institut de Mathématiques de Bordeaux, Université de Bordeaux, France
INRIA Bordeaux Sud Ouest, Team: CQFD
e-mail: [email protected]
Abstract
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature.
Keywords: Markov decision process, expected total reward criterion, occupation measure, constraints, convex program.
AMS 2010 Subject Classification: 90C40, 60J10, 90C90.
1 Introduction
We consider a discrete-time Markov decision process with constraints when all the objectives have the same form of the expected total reward over the infinite time horizon. Markov decision processes are a general family of controlled stochastic processes, which are suitable for the modeling of sequential decision-making problems under uncertainty. They arise in many applications, such as engineering, medicine, biology, operations research, management science, economics, among others.
Markov decision processes (MDPs) under the expected total reward (ETR) criterion have been extensively studied by using mainly different approaches, see e.g. [9] for a complete and exhaustive survey on that subject and also [15, Chapter 2] for an analysis of that topic through examples.
When dealing with constraints, the linear/convex programming approach (also called the convex analytic method, see, e.g. [4, 5]) has proved to be a very powerful technique for solving MDPs. It has been extensively studied in the literature and we refer the interested reader to the following works [2, 4, 5, 10, 14] and the references therein to get an overview of this technique. The convex programming approach can be applied to a large class of control problems including for example, the finite-horizon and the infinite-horizon discounted-reward problems; see, e.g., [5] for further examples of performance functions. For such criteria, the key idea is to reformulate the original dynamic control problem as an infinite dimensional static optimization problem over a space of finite measures given by the occupation measures of the controlled process. However, it must be emphasized that the expected total reward criterion is an exception where the convex programming formulation may not be suitable except for very specific models. As mentioned in [5, p. 357-358] and [12, p. 92-93], the ETR criterion is very demanding from a technical point of view and yields some important technical difficulties which are basically of two types:
- a)
The first issue is directly related to the question of how to properly formulate a convex program associated with an MDP under the ETR criterion. Indeed, as described in [5], the classical and natural approach to formulate a convex program associated to a MDP is to consider as underlying vector space the set of signed finite measures and as variables the occupation measures of the process. However, in the context of the ETR criterion, this approach fails since the occupation measures are not necessarily finite and may take the value infinity. Therefore, the space of finite signed measures is not the appropriate vector space to define the convex program. 2. b)
An important issue is related to the so-called characteristic equation satisfied by the occupation measures of the process which is of the form:
[TABLE]
where and are respectively the state and action spaces; is the transition probability function of the MDP and is the marginal of the measure on . Indeed, a solution to this equation may not correspond to any occupation measures of the controlled process. This difficulty makes the analysis of the ETR criterion very involved by using the convex programming approach.
The objective of the current paper is to propose a suitable convex program for MDPs under the ETR criterion. Our purpose is also to show that the value of the constrained control problem corresponds to the value of an associated convex program and that if there exists an optimal solution to the associated convex program then there exists a stationary randomized policy which is optimal for the MDP. We consider standard assumptions, the so-called continuity-compactness conditions introduced by Schäl in [16, 17]. These assumptions are of two types, namely conditions (S) and (W). Roughly speaking condition (S) requires the transition kernel to be strongly continuous whereas condition (W) refers to the case where the transition kernel is weakly continuous, see, e.g., [17, p. 367-368] for a precise statement of these assumptions. We also suppose the existence of a policy in the interior of the set of admissible policies. This is the so-called Slater condition. Conditions (W) and (S) do not play the same role in the sense that when working with condition (W) instead of condition (S) we have to consider an additional hypothesis requiring the transition kernel of the model to be absolutely continuous with respect to a Markov kernel uniformly in the action variables. Our approach differs from that classically considered in the literature in the sense that the variables of the convex program are not given by the occupation measures of the controlled process but defined on the positive cone of the vector space given by the pair of finite signed stochastic kernels on the action space given the state space.
When compared to the literature, our results appear complementary and our assumptions are rather weak. The references dealing with the ETR criterion by using the convex programming formulation are very scarce in the literature. As for our work, the results in [6, 8] are concerned with general Borel state and action spaces. However, it is important to observe that the approach proposed in [6, 8] does not correspond to a linear/convex programming formulation of an MDP under the ETR criterion. Indeed, the underlying variables of the optimization problem under consideration are given by measures that may take the value infinity and therefore, this set does not enjoy the structure of a standard vector space. This technical issue aside, the results of the current paper differ significantly from those obtained in [6, 8]. The approach developed in [6] deals with models satisfying condition (W) and strongly relies on the positiveness of the cost functions. It must be emphasized that the general framework of signed cost functions cannot be addressed with the technique presented in [6]. In [8], the model under consideration satisfies condition (S) and it was assumed that the transition kernel is absolutely continuous with respect to a reference probability measure uniformly in the state and action variables. In the present work, we show that this assumption is not needed under condition (S). It must be also observed that the approach developed in [8] for signed cost function cannot be applied under condition (W). In [2, Chapter 8], the model is transient or absorbing and is restricted to discrete state and action spaces. Here, we do not impose the MDP to be transient or absorbing. Another advantage of our approach is to propose a convex programming formulation for constrained MDPs under the ETR criterion with signed reward functions and satisfying condition (W). In this context, such formulation has not been so far investigated in the literature. It should be also mentioned that in our work we imposed the so-called Slater condition which is not required in [2, 6, 8]. However, this condition is rather weak and it is a standard assumption in convex optimization problems with constraints, see e.g. [3].
The rest of the paper is organized as follows. In Section 2, we present the control problem that will be considered throughout this work. The assumptions and the convex programming formulation of a constrained discrete-time MDP under the ETR criterion is introduced in Section 3. Important properties of the convex program as well as the constrained control problem are established in Section 4. Our main results are presented in Section 5 showing that the original control problem is equivalent to the convex program. Section 6 is dedicated to the presentation of an example illustrating our results. Finally, a technical result used in Section 4 is derived in an appendix.
2 Description of the control problem
The main goal of this section is to introduce the notation, the parameters defining the model, and to present the construction of the controlled process.
2.1 Notation and terminology
The following basic notation will be used in the forthcoming.
The set of integers is denoted by and corresponds to the non-negative integers, that is, . The set of real numbers is given by . For any subset of , denotes and . We write for with , is the set of extended real numbers, that is, and . Given and in the Euclidean space , let be the usual inner product of and . By we will denote the norm of . Let be the element of with all components equal to zero. If and are in , we shall write when all the components of are greater than or equal to the corresponding components of .
Let be a metric space and denote by its associated Borel -algebra. We use the symbol (respectively ) to denote the positive part (respectively, negative part) of a function . The function is the function whose values are constant and equal to . If is a metric space, denotes the set of real-valued measurable functions defined on . Furthermore, is the space of real-valued bounded continuous functions defined on . The term measure will always refer to a countably additive, -valued set function. The set of measures defined on is denoted by and the set of probability measures on by . For and a positive function in , and for , is defined by where by convention . Consider two metric spaces and . If is a measure on then denotes the marginal of the measure on . A kernel on given is a -valued mapping defined on such that for any , and for any , is a measurable function defined on . A kernel on given is said to be finite if for any . The set of finite kernels on given is denoted . A stochastic (or Markov) kernel on given is a kernel in satisfying for any . The set of stochastic kernels on given will be denoted by . Let be a stochastic kernel on given , then, for a function , we define as
[TABLE]
provided that is quasi-integrable with respect to the probability measure for any . For a measure on , we denote by the measure on .
2.2 The control model.
Let us consider the stationary model
[TABLE]
consisting of:
- (a)
A Borel space (that is, a Borel subset of a complete and separable metric space), which is the state space. 2. (b)
A Borel space , representing the control or action set. 3. (c)
A family of non-empty measurable subsets of , where is the set of feasible controls or actions when the system is in state . We suppose that
[TABLE]
is a measurable subset of . There exists a measurable map with . For notational convenience, we introduce recursively the set of histories up to time by defining and for . 4. (d)
A stochastic kernel on given , which stands for the transition probability function. 5. (e)
The one-step reward function is given by a measurable function . 6. (f)
For , the measurable mappings are the one-step constraint functions. 7. (g)
The constraint limits are real numbers given by \theta^{*}=\big{\{}\theta^{*}_{i}\big{\}}_{i\in\mathbb{N}_{q}}. 8. (h)
Finally, the initial distribution is .
A control policy (a policy, for short) is a sequence of stochastic kernels on given such that for any . Let be the set of all policies. A policy is called a stationary randomized policy if there exists a stochastic kernel on given satisfying for any and for any and . In such as case, we will write instead of to emphasize that the corresponding stationary randomized policy is generated by . Let be the set of all stationary randomized policies.
To state the optimal control problem we are concerned with, we introduce the canonical space consisting of the set of sample paths and the associated product -algebra . The projection from to the state space and the action space at time are denoted by and . That is, for
[TABLE]
for . Consequently, is the state process and is the control process. It is a well known result that for every policy and any initial probability measure on there exists a unique probability measure on such that and
[TABLE]
[TABLE]
[TABLE]
, for any .
The expectation with respect to is denoted by . The so-called occupation measure generated by a policy , denoted by , is defined by
[TABLE]
for any . Denote by (respectively, ) the set of occupation measures generated by randomized (respectively, stationary) policies.
Statement of the control problem.
For and , define by
[TABLE]
where by convention . In fact, assumptions will be introduced in the next section to avoid dealing with such cases. Observe that can be written equivalently in terms of the occupation measure generated by the policy as follows
[TABLE]
In this paper, we will repeatedly use this equality without mentioning it.
Definition 2.1
A policy is said to be admissible if for . The set of admissible policies will be denoted by . The optimal control problem we consider consists in maximizing the expected reward over the set of admissible policies . The value associated to this constrained control problem is given by \sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}}. A policy is optimal if and \mathcal{J}_{\nu}(r,\hat{\pi})=\sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}}.
3 Assumptions and the convex programming formulation
The objective of this section is both to list the assumptions we will use in this work and to introduce the convex program associated with the control problem we presented in the previous section. In this work, we deal with MDPs satisfying the so-called Conditions (W) or (S) which are standard hypotheses of the literature, see for example [16].
Condition (W)
- (W1)
For any , the action set is compact and the multifunction from to defined by is upper-semicontinuous.
- (W2)
For any , is continuous on .
- (W3)
The reward and the constraint for are upper-semicontinuous on .
Condition (S)
- (S1)
For any , is compact.
- (S2)
For any and , is continuous on .
- (S3)
For any , the reward and the constraint for are upper-semicontinuous on .
In order to introduce the convex program associated to an MDP under the ETR criterion, we need to make some hypotheses. First, it is assumed that the transition kernel of the MDP under consideration is absolutely continuous with respect to a Markov kernel (see Assumption 3). This hypothesis is rather weak and is satisfied in a large number of practical cases as discussed in the remark below.
- Assumption A.
There exists satisfying for any . Associated to the kernel , will denote the probability measure on defined by
[TABLE]
Remark 3.1
In Lemma 3.2 below, it is shown that under Conditions (S1) and (S2), Assumption 3 is satisfied. 2. 2.
If the sets of feasible actions are countable, that is where for any is a measurable function from to then Assumption 3 is satisfied for defined by
[TABLE]
for any . 3. 3.
If for any then clearly Assumption 3 is satisfied. This condition corresponds to the main hypothesis used in **[8*]**. It is of course less general than Assumption 3 but it is naturally satisfied for a large class of practical systems. Indeed, in many applications, the evolution of an MDP is specified by a discrete-time equation of the form where is an -valued measurable mapping defined on and is an independent and identically distributed sequence of random variables with density with respect to the Lebesgue measure on . By using the change of variable formula, we obtain that showing that for any is satisfied for defined for example by the standard normal distribution on .
Observe also that when is finite or countable, for any is satisfied when is given for example by a geometric distribution.
Lemma 3.2
Conditions (S1) and (S2) imply Assumption 3, that is, with given by
[TABLE]
where is a sequence of measurable selectors from the multifunction defined from to by and satisfying for any .
Proof: The multifunction from to defined by is by assumption Borel measurable and so, weakly measurable. From (S1), Corollary 18.15 in [1] gives the existence of a sequence of measurable selectors from the multifunction satisfying for any . Now by using (S2), we obtain that for any for the Markov kernel defined by (3).
Remark 3.3
The previous proof is an extension of an argument used in the proof of Theorem 1 in [13, p. 183].
In the next definition, we introduce the set of feasible variables. It will be shown below that it is a convex subset of the vector space of finite signed kernels on given .
Definition 3.4
Suppose Assumption 3 holds and let be the measure introduced in (2).
- •
For , will denote the measure in given by
[TABLE]
recalling that is constant function equal to infinity.
- •
Consider as the set of satisfying
[TABLE]
[TABLE]
and
[TABLE]
Any induces a measure that will be called the -measure generated by . is called the set of feasible variables.
Remark 3.5
Observe first that and in particular,
[TABLE]
for any and . Therefore, is a convex subset of the vector space of signed finite kernel on given .
Definition 3.6
Let . Introduce the kernel on given defined by
[TABLE]
where
[TABLE]
Observe that is a stochastic kernel satisfying for any . The stationary randomized policy will be called the policy induced by .
We will also need the following technical hypothesis:
-
Assumption B.
-
(B.1)
\displaystyle\sup\big{\{}\eta^{\Phi}(r^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}} and \displaystyle\sup\big{\{}\eta^{\Phi}(c^{+}_{i}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty for any .
- (B.2)
and for any , .
This hypothesis is comparable to Assumption (A2) introduced in [8, p. 847]. Assumption (B.1) essentially imposes that the values of the unconstrained convex programs associated to a reward function given by either or for are different from while Assumption (B.2) ensure that the performance criteria associated to the reward and the constraints for are not equal . In particular, Assumption (B.1) will be used to introduce the linear program.
Definition 3.7
Suppose Assumptions 3 and (B.1) hold. The convex program, denoted by , consists in maximizing over subject to for any . The value of the convex program is given by
[TABLE]
A variable is said to be an optimal solution to the convex program if
[TABLE]
and for any .
Remark 3.8
Let be a function given by either or for . From Assumption (B.1), it follows that is well defined for any and . Therefore, we obtain from equation (5) that
[TABLE]
for any and . This implies that the mathematical program defined in (8) is indeed a convex program. In [3, p. 153], a convex program is written in terms of an infimum. The program introduced in Definition 3.7 can be equivalently written in terms of an infimum by changing the sign of the reward function. We prefer to keep this setting to deal with an MDP under a reward optimization criterion.
Finally, we introduce an additional standard hypothesis:
The Slater condition
- There exists such that for any .
4 Preliminary results
The main goal of this section is to establish several properties of the constrained control problem as well as properties of the convex program.
4.1 Properties of the convex program
In this subsection, we will show in Lemma 4.2 that for any stationary randomized policy there exists such that the -measure generated by is equal to the occupation measure generated by the stationary randomized policy . An important result which is a cornerstone of the paper is presented at the end of this subsection. It can be roughly stated as follows: for any feasible variable of the convex program, the reward associated to the stationary randomized policy is greater than for specific functions that will be discussed in Theorem 4.3. To get these results, we first need to establish that the occupation measures of the controlled process have a special structure, that is, the marginal on of any occupation measure is absolutely continuous with respect to the probability measure introduced in Assumption 3.
Lemma 4.1
Suppose Assumption 3 holds. Then for any ,
[TABLE]
where is defined in (2).
Proof: For any , it can be easily shown from Lemma 9.4.3 in [11] the existence of an increasing sequence of finite measures on and a sequence of stochastic kernels on given satisfying and
[TABLE]
and
[TABLE]
for , and . Let us show by induction that for any . We have clearly . Assume that . Observe that for any implying that
[TABLE]
and so, combining (2) and (11) we have . We obtain the result by using (10).
As a consequence, we can show that the set of the -measures contains the occupation mesures generated by the stationary randomized policies.
Lemma 4.2
Suppose Assumption 3 holds. For any , there exists such that
[TABLE]
Proof: Let . Clearly, the increasing sequence of finite measures defined on by
[TABLE]
for any converges to . From Lemma 4.1, there exists a sequence of increasing measurable -valued functions defined on such that for and so, . Therefore,
[TABLE]
where . Consequently, defined by and belongs to since .
The following result is in a way a converse of the previous one. It is a key result in our work. Roughly speaking, it states that for any feasible variable of the convex program, the reward associated to the stationary randomized policy is greater than for specific functions described below.
Theorem 4.3
Suppose that Assumption 3 holds. For any , there exists such that
[TABLE]
for any satisfying \displaystyle\sup\big{\{}\eta^{\Phi}(h^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty.
Proof: For satisfying \displaystyle\sup\big{\{}\eta^{\Phi}(h^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty, let us prove the result by showing that
[TABLE]
where is the stationary randomized policy induced by (see (6)). There is no loss of generality to assume that and so we have . We are going to proceed by contradiction to get (12). More precisely, if then we will introduce a sequence in satisfying contradicting the hypothesis. The proof is divided into two steps. We will first introduce and show that for any . In a second step, it will be proven that showing the result.
First step: construction of a sequence in .
Let be the occupation measure induced by the stationary randomized policy . As in the proof of Lemma 4.2, there exists a measurable -valued function defined on satisfying
[TABLE]
For , consider where is given by
[TABLE]
and is a signed kernel on given defined by
[TABLE]
Observe that in the previous definition, is well defined since . To get the result, we will proceed in two steps. First we will show that on implying that and so, for any . In a second step, we will prove that .
Let us show that .
From (4), and so, by using (7)
[TABLE]
where by convention . Recalling the Definition of (see equation (6)), we easily obtain and and so, we get
[TABLE]
Therefore,
[TABLE]
Since , we have by using (16)
[TABLE]
and with (17) it follows
[TABLE]
However, is the minimal solution to the equation and so, . Combining equations (13) and (17), we obtain \Big{[}I_{\boldsymbol{\mathcal{E}}_{\Phi}}(\cdot)\varphi^{*}(\mathbf{A}|\cdot)+I_{\boldsymbol{\mathcal{E}}_{\Phi}^{c}}(\cdot)\mathcal{I}_{\infty}(\cdot)\Big{]}\geq\mathcal{D}_{\varphi_{\Phi}}(\cdot) . Consequently, on and according to the definition of (see equation (13)), there is no loss of generality to claim
[TABLE]
Therefore, and so, for any .
Let us show that .
Recalling the definition (see equations (14)-(15)), we have and for any . The only point which remains to prove is that satisfies
[TABLE]
Combining the definition of (see equations (14)-(15)) and the expression of (see equation (16)), we obtain
[TABLE]
where is given by
[TABLE]
To show that (20) holds, we will consider two cases.
a) Firstly, we will show that equation (20) is satisfied on . For that, let us consider . From (21), we have . However, showing that . If we show that then implying that (20) holds on . To see that , observe from (22) that
[TABLE]
Assuming that and combining (13), (17) and the previous equation we have
[TABLE]
Now, we obtain by using (18) and the fact that
[TABLE]
implying also
[TABLE]
[TABLE]
Recalling that , we have with (13) and (25)
[TABLE]
The two previous equations gives
[TABLE]
[TABLE]
Recalling the definition of (see (22)) we get for with . However, equation (17) implies that is -finite on and combining (16) and (22), we have . Therefore, it follows that for any in , and so (20) holds on .
b) Secondly, we will show that equation (20) is satisfied on . For that, let . It is important to observe from (17) that in this case or . Therefore, we obtain on one hand by recalling (21) and using the fact that and on the other hand since by (22)
[TABLE]
where the last inequality comes from (18). Therefore, showing that (20) holds on .
Finally, equation (20) is satisfied and as a consequence for any .
Second step: .
Recalling that , we get from (16)
[TABLE]
implying also
[TABLE]
Therefore, combining (13), (16), (22) and the two previous equations we obtain easily that
[TABLE]
If then and giving the result.
4.2 Properties of the constrained control problem
The main objective of this subsection is to show that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. Our results use Theorem A.1 presented in the Appendix which is a slight modification of Theorem 1 in Schäl [17] who has established a stronger version of this type of result but in the unconstrained case. To use Schäl’s results, we need to impose Conditions (W) or (S) and in addition, to deal with the constrained case, we need to impose a Slater-type condition.
The next technical Lemma shows that, roughly speaking, under Assumption (B.1), the unconstrained control problems associated to a reward function given by either or for are different from .
Lemma 4.4
Suppose Assumptions 3 and (B.1) and either Conditions (W) or (S) hold. Then,
[TABLE]
for .
Proof: The idea is to apply Theorem A.1 to the unconstrained models associated to the reward functions given by one of the following mappings: and for . Clearly, the Convergence Assumption and the Continuity and Compactness Assumptions in [17, p. 367] are satisfied. Therefore, we have by using Theorem A.1
[TABLE]
for any function given by either or for . Now, from Assumption 3 we can apply Lemma 4.2 to have
[TABLE]
Recalling Assumption (B.1) we obtain the result.
The next result shows that if the Slater condition is satisfied for an arbitrary policy then there exists a stationary randomized policy satisfying the same type of condition.
Proposition 4.5
Suppose Assumptions 3, 3 and either Conditions (W) or (S) hold. If the Slater condition is satisfied, then there exists satisfying for any .
Proof: The result is proved by induction. Applying Theorem A.1 for the unconstrained model associated to the reward function , we have
[TABLE]
Since (by recalling the Slater condition), we have implying the existence of such that . For , let us assume the existence of such that for . Therefore, we can combine Lemma 4.4 and Proposition A.2 to obtain
[TABLE]
However,
[TABLE]
implying the existence of such that for . This gives the result.
Below is the main result of this subsection that states roughly speaking that in the framework of constrained control problems, the supremums of the expected total rewards over the set of randomized policies and over the set of stationary randomized policies coincide.
Theorem 4.6
Suppose Assumptions 3, 3 and either Conditions (W) or (S) hold. If the Slater condition is satisfied, then
[TABLE]
Proof: Applying Proposition 4.5, there exists satisfying the Slater condition, that is, for . Now, combining Lemma 4.4 and Proposition A.2, we obtain the result.
5 Main results
In this section, we present the main results of this paper showing that the original control problem is equivalent to the convex program introduced in Definition 3.7 for a weakly or strongly continuous transition kernel.
The case of Condition (W)
Theorem 5.1
Suppose Assumptions 3, 3 and Condition (W) hold. If the Slater condition is satisfied, then
[TABLE]
where is defined in (2). Moreover, if is an optimal solution to the convex program then the stationary randomized policy induced by is optimal for the constrained control problem, that is,
[TABLE]
Proof: Theorem 4.6 states that
[TABLE]
However, from Lemma 4.2, we have
[TABLE]
Now, consider . By using Theorem 4.3, for given either or for implying that and also the reverse inequality
[TABLE]
showing the first part of the result.
Now if is an optimal solution to the convex program then for any and \eta^{\hat{\Phi}}(r)=\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}. Therefore, the stationary randomized policy satisfies by using Theorem 4.3. Now, by using the first part of the result (see equation (27)) it follows that \mathcal{J}_{\nu}(r,\varphi_{\hat{\Phi}})\geq\sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}} giving the last part of the result.
Remark 5.2
As mentioned in the introduction, the previous result has the advantage of proposing a convex programming formulation for constrained MDPs under the ETR criterion with signed reward functions and satisfying condition (W) which has not been so far addressed in the literature. In [6], the authors do not really analyse a convex program, but study a related optimization problem where the MPDs under consideration satisfy condition (W) but the proposed approach strongly relies on the positiveness of the cost functions and cannot be generalized to the framework of signed cost functions.
The case of condition (S)
Theorem 5.3
Suppose Assumptions 3 and Condition (S) hold. If the Slater condition is satisfied, then
[TABLE]
where is defined in (2) for given by (3). Moreover, if is an optimal solution to the convex program then the stationary randomized policy induced by is optimal for the constrained control problem introduced in Definition 2.1.
Proof: Up to the definition of whose existence is established in Lemma 3.2, the proof of this result is identical to that of Theorem 5.1.
Remark 5.4
In [8], the authors do not really analyse a convex program but study a related optimization problem where the MPDs under consideration satisfy condition (S) by assuming that the transition kernel is absolutely continuous with respect to a reference probability measure uniformly in the state and action variables. In the previous result, we show that this assumption is not needed under condition (S) if this hypothesis is replaced by a Slater-type condition.
6 Example
In this section, we provide an example with one constraint to illustrate our results and compare them with reference [8]. The results obtained in [6] cannot be used for this model because the contraint function takes positive and negative values. We will show that one of the conditions of [8] is not satisfied while the approach developed in the present paper can be applied. This example shows that there is a gap between the initial optimization problem and the mathematical program associated to the measures satisfying the characteristic equation, that is,
[TABLE]
It means that the characteristic equation generates measures that do not correspond to any occupation measures of the process. This type of measures has been called in [7] phantom solutions of the characteristic equation. The interesting point is that at the same time, we may have
[TABLE]
This means that the set \big{\{}\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}} which is by the way a subset of \big{\{}\mu\in\boldsymbol{\mathcal{M}}(\mathbf{X}):\mu_{\mathbf{X}}=\nu+\mu Q\} may generate less of phantom solutions.
Two different values of the constraint limit will be studied. For the first value of the constraint limit, it will be shown that the approach proposed in the present paper can be applied implying that the value of the original control problem coincides with the value of the convex program . When changing the value of the constraint limit, the Slater condition will not be satisfied. However, it is interesting to observe that in this latter case, the values of the original control problem and its associated convex program still coincide although the Slater condition is not fulfilled. It appears that the Slater condition is not a necessary condition to establish the correspondance between the constrained control problem and its associated convex program .
We consider the control model
[TABLE]
where and the action set is given by . For , ; and . The stochastic kernel is given by for and , for and finally, . The one-step reward function is given by for ; and . The one-step constraint function is given by for ; and . The initial distribution satisfies . The constraint limit is given by . Two cases are studied: and .
Let satisfying the characteristic equation and so, ; for ; and finally, and for showing that for , . Therefore,
[TABLE]
since . This implies that Assumption (A2) in [8] is not satisfied and therefore, the approach developed there cannot be applied.
The stochastic kernel on given defined by for and satisfies Assumption 3.
The probability associated to and given by (2) satisfies for . As a consequence, for any and . Moreover, since satisfies the characteristic equation, it follows that and for and . Thus,
[TABLE]
and similarly,
[TABLE]
where . Clearly, we have and for any showing that Assumption (B.1) is satisfied.
Now, let (respectively, ) be the deterministic stationary policy given by for (respectively, if and ). It is easy to see that the occupation measure is given by ; ; ; for any and for and the occupation measure satisfies for any ; and . It follows easily and . Observe also that and . Clearly, the reward takes values in the interval when the policy ranges over and the constraint takes values in . Therefore, Assumption (B.2) is satisfied.
Finally, Condition (W) is obviously satisfied for this model.
Remark that for any , the stationary randomized policy given by , and for yields and .
The case where From the previous discussion, we have
[TABLE]
where is the stationary randomized policy given by , for . Moreover,
[TABLE]
Therefore, the values of the original control problem and the convex program agree as claimed by Theorem 5.1 since the Slater condition holds.
Observe that the optimal value of the convex program is achieved for where is an optimal solution to the convex program . Since , the stationary policy induced by is given by and for . This optimal policy corresponds to as determined above.
The case where We have for this value of the constraint limit,
[TABLE]
where is the stationary randomized policy given by , , for .
However, we cannot apply the results of the present paper because in this case the Slater condition is not satisfied. Indeed, for any , . But, the values of the original control problem and the convex program still agree since
[TABLE]
Appendix A Appendix
In this appendix, let be an integer in . Consider the functions and for . We will first present a slightly different version of a result derived by M. Schäl in [17, Theorem 1]. The only difference is that, we consider here the expected total reward criterion while in [17], Schäl deals with the conditional version of that performance criterion. We will use it repeatedly in this paper. In this section we will also establish a technical result that is used in section 4.2 to show that in the framework of control problems with constraints, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies.
To use Theorem 1 in [17], we need to introduce the following two sets of conditions:
For any , is compact. 2.
For any and , is continuous on . 3.
For any , is upper-semicontinuous on . 4.
For any , for are upper-semicontinuous on .
or
For any , the action set is compact and the multifunction from to defined by is upper-semicontinuous. 2.
For any , is continuous on . 3.
The function is upper-semicontinuous on . 4.
The functions for are upper-semicontinuous on .
Theorem A.1
Suppose or for any and either conditions - or - are satisfied. Then
[TABLE]
Proof: The proof of this result is essentially the same as Theorem 1 in [17]. The only difference is that, we consider here the expected total reward criterion while in [17], Schäl deals with the conditional version of that performance criterion. By adapting the arguments developed in [17], we obtain easily the result.
Proposition A.2
Consider . Assume \sup\big{\{}\mu(h^{+}+g_{i}^{+}):\mu\in\boldsymbol{\mathcal{O}}\mathop{\cup}\{\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\}\big{\}}<+\infty; and for . Suppose also that Assumption 3 and either conditions - or - are satisfied. If there exists satisfying for any then
[TABLE]
Proof: Let be either or . Clearly for any , in and . Let us define . is clearly a non-empty convex subset of . Define the function on by
[TABLE]
By hypothesis, takes values in for any . Observe that is a proper concave on . Indeed, consider and in and . For any , there exist for satisfying and for . Clearly, we have \big{(}\beta\mu_{1,\epsilon}+(1-\beta)\mu_{2,\epsilon}\big{)}(g_{i})\geq\beta\theta_{1,i}+(1-\beta)\theta_{2,i} for any . Therefore,
[TABLE]
showing that is a proper concave function on . Now, is in the interior of , and so is continuous at by Proposition 2.17 in [3] and therefore, we can apply Proposition 2.36 in [3] to claim the existence of such that, for all ,
[TABLE]
Remark that since for all . Now, fix an arbitrary . Then and so,
[TABLE]
Therefore,
[TABLE]
For any , there exists with for any such that implying
[TABLE]
since . Together with (31), this shows
[TABLE]
Now, we have for ,
[TABLE]
implying
[TABLE]
and so with (32) we obtain
[TABLE]
Therefore, with
[TABLE]
and with
[TABLE]
Now, for we have \sup\Big{\{}\eta^{\Phi}\Big{(}\big{(}h-\langle\lambda,g\rangle\big{)}^{+}\Big{)}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\Big{\}}<+\infty by hypothesis and we obtain from Lemma 4.2 and Theorem 4.3 that
[TABLE]
and also,
[TABLE]
Therefore, combining equations (34)-(36) we obtain that
[TABLE]
Moreover, Theorem A.1 can be applied to show that
[TABLE]
Combining equations (33), (37) and (38), we obtain the result.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Aliprantis and K. Border. Infinite dimensional analysis . Springer, Berlin, third edition, 2006. A hitchhiker’s guide.
- 2[2] E. Altman. Constrained Markov decision processes . Stochastic Modeling. Chapman & Hall/CRC, Boca Raton, FL, 1999.
- 3[3] V. Barbu and T. Precupanu. Convexity and optimization in Banach spaces . Springer Monographs in Mathematics. Springer, Dordrecht, fourth edition, 2012.
- 4[4] V. Borkar. A convex analytic approach to Markov decision processes. Probab. Theory Related Fields , 78(4):583–602, 1988.
- 5[5] V. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov decision processes , volume 40 of Internat. Ser. Oper. Res. Management Sci. , pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.
- 6[6] F. Dufour, M. Horiguchi, and A. Piunovskiy. The expected total cost criterion for Markov decision processes under constraints: a convex analytic approach. Advances in Applied Probability , 44(3):774–793, 2012.
- 7[7] F. Dufour and A. Piunovskiy. Multiobjective stopping problem for discrete-time Markov processes: convex analytic approach. J. Appl. Probab. , 47(4):947–966, 2010.
- 8[8] F. Dufour and A. Piunovskiy. The expected total cost criterion for Markov decision processes under constraints. Advances in Applied Probability , 45(3):837–859, 2013.
