A convex programming approach for discrete-time Markov decision   processes under the expected total reward criterion

F. Dufour (CQFD); Alexandre Genadot (CQFD)

arXiv:1903.08853·math.PR·May 10, 2019·SIAM J. Control. Optim.

A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion

F. Dufour (CQFD), Alexandre Genadot (CQFD)

PDF

TL;DR

This paper introduces a convex programming approach for constrained discrete-time Markov decision processes with Borel spaces, establishing the equivalence of optimal values and policies under the expected total reward criterion.

Contribution

It formulates a convex programming model for constrained MDPs with Borel spaces and proves the existence of stationary optimal policies under weak assumptions.

Findings

01

Convex programming formulation matches the constrained MDP's optimal value.

02

Existence of stationary randomized policies for optimal solutions.

03

Supremum of expected total rewards over randomized policies equals that over stationary policies.

Abstract

In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses…

Equations218

μ_{X} (\cdot) = ν (\cdot) + \int_{X \times A} Q (\cdot ∣ x, a) μ (d x, d a)

μ_{X} (\cdot) = ν (\cdot) + \int_{X \times A} Q (\cdot ∣ x, a) μ (d x, d a)

Q v (y) := \int_{X} v^{+} (x) Q (d x ∣ y) - \int_{X} v^{-} (x) Q (d x ∣ y),

Q v (y) := \int_{X} v^{+} (x) Q (d x ∣ y) - \int_{X} v^{-} (x) Q (d x ∣ y),

\big{(}\mathbf{X},\mathbf{A},\{\mathbf{A}(x):x\in\mathbf{X}\},Q,r,c,\theta_{*},\nu\big{)}

\big{(}\mathbf{X},\mathbf{A},\{\mathbf{A}(x):x\in\mathbf{X}\},Q,r,c,\theta_{*},\nu\big{)}

K := {(x, a) \in X \times A : a \in A (x)}

K := {(x, a) \in X \times A : a \in A (x)}

ω = (y_{1}, b_{1}, \dots, y_{t}, b_{t} \dots) \in Ω we have X_{t} (ω) = y_{t} and A_{t} (ω) = b_{t}

ω = (y_{1}, b_{1}, \dots, y_{t}, b_{t} \dots) \in Ω we have X_{t} (ω) = y_{t} and A_{t} (ω) = b_{t}

P_{ν}^{π} (X_{1} \in B) = ν (B), for B \in B (X),

P_{ν}^{π} (X_{1} \in B) = ν (B), for B \in B (X),

P_{ν}^{π} (X_{t + 1} \in C ∣ σ {X_{1}, \dots, X_{t}, A_{t}}) = Q (C ∣ X_{t}, A_{t}) for C \in B (X),

P_{ν}^{π} (X_{t + 1} \in C ∣ σ {X_{1}, \dots, X_{t}, A_{t}}) = Q (C ∣ X_{t}, A_{t}) for C \in B (X),

P_{ν}^{π} (A_{t} \in D ∣ σ {X_{1}, \dots, X_{t - 1}, A_{t - 1}, X_{t}}) = π_{t} (D ∣ X_{1}, \dots, X_{t - 1}, A_{t - 1}, X_{t}) for D \in B (A),

P_{ν}^{π} (A_{t} \in D ∣ σ {X_{1}, \dots, X_{t - 1}, A_{t - 1}, X_{t}}) = π_{t} (D ∣ X_{1}, \dots, X_{t - 1}, A_{t - 1}, X_{t}) for D \in B (A),

\mu^{\pi}(\Gamma)=\sum_{t=1}^{\infty}\mathbb{P}_{\nu}^{\pi}\big{(}(X_{t},A_{t})\in\Gamma\big{)}

\mu^{\pi}(\Gamma)=\sum_{t=1}^{\infty}\mathbb{P}_{\nu}^{\pi}\big{(}(X_{t},A_{t})\in\Gamma\big{)}

\displaystyle\mathcal{J}_{\nu}(h,\pi)=\sum_{t=1}^{\infty}\mathbb{E}_{\nu}^{\pi}\big{[}h^{+}(X_{t},A_{t})\big{]}-\sum_{t=1}^{\infty}\mathbb{E}_{\nu}^{\pi}\big{[}h^{-}(X_{t},A_{t})\big{]}

\displaystyle\mathcal{J}_{\nu}(h,\pi)=\sum_{t=1}^{\infty}\mathbb{E}_{\nu}^{\pi}\big{[}h^{+}(X_{t},A_{t})\big{]}-\sum_{t=1}^{\infty}\mathbb{E}_{\nu}^{\pi}\big{[}h^{-}(X_{t},A_{t})\big{]}

J_{ν} (h, π) = μ^{π} (h) .

J_{ν} (h, π) = μ^{π} (h) .

p (d x) = k \in N \sum \frac{1}{2 ^{k + 1}} ν P^{k} (d x) .

p (d x) = k \in N \sum \frac{1}{2 ^{k + 1}} ν P^{k} (d x) .

P (d y ∣ x) = k \in N^{*} \sum \frac{1}{2 ^{k}} Q (d y ∣ x, a_{k} (x)),

P (d y ∣ x) = k \in N^{*} \sum \frac{1}{2 ^{k}} Q (d y ∣ x, a_{k} (x)),

P (d y ∣ x) = k \in N^{*} \sum \frac{1}{2 ^{k}} Q (d y ∣ x, ξ_{k} (x))

P (d y ∣ x) = k \in N^{*} \sum \frac{1}{2 ^{k}} Q (d y ∣ x, ξ_{k} (x))

η^{Φ} (d x, d a) = I_{\infty} (x) φ^{\infty} (d a ∣ x) p (d x) + φ^{*} (d a ∣ x) p (d x),

η^{Φ} (d x, d a) = I_{\infty} (x) φ^{\infty} (d a ∣ x) p (d x) + φ^{*} (d a ∣ x) p (d x),

φ^{\infty} (A ∣ x) + φ^{*} (A ∣ x) > 0,

φ^{\infty} (A ∣ x) + φ^{*} (A ∣ x) > 0,

φ^{\infty} (A (x)^{c} ∣ x) + φ^{*} (A (x)^{c} ∣ x) = 0,

φ^{\infty} (A (x)^{c} ∣ x) + φ^{*} (A (x)^{c} ∣ x) = 0,

η_{X}^{Φ} = ν + η^{Φ} Q .

η_{X}^{Φ} = ν + η^{Φ} Q .

η^{α Φ_{1} + (1 - α) Φ_{2}} = α η^{Φ_{1}} + (1 - α) η^{Φ_{2}},

η^{α Φ_{1} + (1 - α) Φ_{2}} = α η^{Φ_{1}} + (1 - α) η^{Φ_{2}},

φ_{Φ} (d a ∣ x) = I_{E_{Φ}^{c}} (x) \frac{φ ^{\infty} ( d a ∣ x )}{φ ^{\infty} ( A ∣ x )} + I_{E_{Φ}} (x) \frac{φ ^{*} ( d a ∣ x )}{φ ^{*} ( A ∣ x )} .

φ_{Φ} (d a ∣ x) = I_{E_{Φ}^{c}} (x) \frac{φ ^{\infty} ( d a ∣ x )}{φ ^{\infty} ( A ∣ x )} + I_{E_{Φ}} (x) \frac{φ ^{*} ( d a ∣ x )}{φ ^{*} ( A ∣ x )} .

E_{Φ} = {x \in X : φ^{\infty} (A ∣ x) = 0} .

E_{Φ} = {x \in X : φ^{\infty} (A ∣ x) = 0} .

\displaystyle\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}.

\displaystyle\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}.

\eta^{\hat{\Phi}}(r)=\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}

\eta^{\hat{\Phi}}(r)=\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}

η^{α Φ_{1} + (1 - α) Φ_{2}} (h) = α η^{Φ_{1}} (h) + (1 - α) η^{Φ_{2}} (h)

η^{α Φ_{1} + (1 - α) Φ_{2}} (h) = α η^{Φ_{1}} (h) + (1 - α) η^{Φ_{2}} (h)

μ_{X} (d x) ≪ p (d x)

μ_{X} (d x) ≪ p (d x)

k \to \infty lim μ_{k} (Λ) = μ_{X} (Λ)

k \to \infty lim μ_{k} (Λ) = μ_{X} (Λ)

μ_{k + 1} (Λ) = ν (Λ) + \int_{X} \int_{A} Q (Λ∣ x, a) φ_{k} (d a ∣ x) μ_{k} (d x)

μ_{k + 1} (Λ) = ν (Λ) + \int_{X} \int_{A} Q (Λ∣ x, a) φ_{k} (d a ∣ x) μ_{k} (d x)

\int_{X} \int_{A} Q (\cdot ∣ x, a) φ_{k} (d a ∣ x) μ_{k} (d x) ≪ \int_{X} P (\cdot ∣ x) p (d x)

\int_{X} \int_{A} Q (\cdot ∣ x, a) φ_{k} (d a ∣ x) μ_{k} (d x) ≪ \int_{X} P (\cdot ∣ x) p (d x)

μ^{π} = η^{Φ} .

μ^{π} = η^{Φ} .

\mu^{\pi}_{t}(\Gamma)=\sum_{k=1}^{t}\mathbb{P}_{\nu}^{\pi}\big{(}(X_{k},A_{k})\in\Gamma\big{)}

\mu^{\pi}_{t}(\Gamma)=\sum_{k=1}^{t}\mathbb{P}_{\nu}^{\pi}\big{(}(X_{k},A_{k})\in\Gamma\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion

F. Dufour

Institut Polytechnique de Bordeaux

INRIA Bordeaux Sud Ouest, Team: CQFD

IMB, Institut de Mathématiques de Bordeaux, Université de Bordeaux, France

e-mail: [email protected]

A. Genadot

IMB, Institut de Mathématiques de Bordeaux, Université de Bordeaux, France

INRIA Bordeaux Sud Ouest, Team: CQFD

e-mail: [email protected]

Abstract

In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature.

Keywords: Markov decision process, expected total reward criterion, occupation measure, constraints, convex program.

AMS 2010 Subject Classification: 90C40, 60J10, 90C90.

1 Introduction

We consider a discrete-time Markov decision process with constraints when all the objectives have the same form of the expected total reward over the infinite time horizon. Markov decision processes are a general family of controlled stochastic processes, which are suitable for the modeling of sequential decision-making problems under uncertainty. They arise in many applications, such as engineering, medicine, biology, operations research, management science, economics, among others.

Markov decision processes (MDPs) under the expected total reward (ETR) criterion have been extensively studied by using mainly different approaches, see e.g. [9] for a complete and exhaustive survey on that subject and also [15, Chapter 2] for an analysis of that topic through examples.

When dealing with constraints, the linear/convex programming approach (also called the convex analytic method, see, e.g. [4, 5]) has proved to be a very powerful technique for solving MDPs. It has been extensively studied in the literature and we refer the interested reader to the following works [2, 4, 5, 10, 14] and the references therein to get an overview of this technique. The convex programming approach can be applied to a large class of control problems including for example, the finite-horizon and the infinite-horizon discounted-reward problems; see, e.g., [5] for further examples of performance functions. For such criteria, the key idea is to reformulate the original dynamic control problem as an infinite dimensional static optimization problem over a space of finite measures given by the occupation measures of the controlled process. However, it must be emphasized that the expected total reward criterion is an exception where the convex programming formulation may not be suitable except for very specific models. As mentioned in [5, p. 357-358] and [12, p. 92-93], the ETR criterion is very demanding from a technical point of view and yields some important technical difficulties which are basically of two types:

a)

The first issue is directly related to the question of how to properly formulate a convex program associated with an MDP under the ETR criterion. Indeed, as described in [5], the classical and natural approach to formulate a convex program associated to a MDP is to consider as underlying vector space the set of signed finite measures and as variables the occupation measures of the process. However, in the context of the ETR criterion, this approach fails since the occupation measures are not necessarily finite and may take the value infinity. Therefore, the space of finite signed measures is not the appropriate vector space to define the convex program. 2. b)

An important issue is related to the so-called characteristic equation satisfied by the occupation measures of the process which is of the form:

[TABLE]

where $X$ and $A$ are respectively the state and action spaces; $Q$ is the transition probability function of the MDP and $\mu_{X}$ is the marginal of the measure $\mu$ on $X$ . Indeed, a solution $\mu$ to this equation may not correspond to any occupation measures of the controlled process. This difficulty makes the analysis of the ETR criterion very involved by using the convex programming approach.

The objective of the current paper is to propose a suitable convex program for MDPs under the ETR criterion. Our purpose is also to show that the value of the constrained control problem corresponds to the value of an associated convex program and that if there exists an optimal solution to the associated convex program then there exists a stationary randomized policy which is optimal for the MDP. We consider standard assumptions, the so-called continuity-compactness conditions introduced by Schäl in [16, 17]. These assumptions are of two types, namely conditions (S) and (W). Roughly speaking condition (S) requires the transition kernel to be strongly continuous whereas condition (W) refers to the case where the transition kernel is weakly continuous, see, e.g., [17, p. 367-368] for a precise statement of these assumptions. We also suppose the existence of a policy in the interior of the set of admissible policies. This is the so-called Slater condition. Conditions (W) and (S) do not play the same role in the sense that when working with condition (W) instead of condition (S) we have to consider an additional hypothesis requiring the transition kernel of the model to be absolutely continuous with respect to a Markov kernel uniformly in the action variables. Our approach differs from that classically considered in the literature in the sense that the variables of the convex program are not given by the occupation measures of the controlled process but defined on the positive cone of the vector space given by the pair of finite signed stochastic kernels on the action space given the state space.

When compared to the literature, our results appear complementary and our assumptions are rather weak. The references dealing with the ETR criterion by using the convex programming formulation are very scarce in the literature. As for our work, the results in [6, 8] are concerned with general Borel state and action spaces. However, it is important to observe that the approach proposed in [6, 8] does not correspond to a linear/convex programming formulation of an MDP under the ETR criterion. Indeed, the underlying variables of the optimization problem under consideration are given by measures that may take the value infinity and therefore, this set does not enjoy the structure of a standard vector space. This technical issue aside, the results of the current paper differ significantly from those obtained in [6, 8]. The approach developed in [6] deals with models satisfying condition (W) and strongly relies on the positiveness of the cost functions. It must be emphasized that the general framework of signed cost functions cannot be addressed with the technique presented in [6]. In [8], the model under consideration satisfies condition (S) and it was assumed that the transition kernel is absolutely continuous with respect to a reference probability measure uniformly in the state and action variables. In the present work, we show that this assumption is not needed under condition (S). It must be also observed that the approach developed in [8] for signed cost function cannot be applied under condition (W). In [2, Chapter 8], the model is transient or absorbing and is restricted to discrete state and action spaces. Here, we do not impose the MDP to be transient or absorbing. Another advantage of our approach is to propose a convex programming formulation for constrained MDPs under the ETR criterion with signed reward functions and satisfying condition (W). In this context, such formulation has not been so far investigated in the literature. It should be also mentioned that in our work we imposed the so-called Slater condition which is not required in [2, 6, 8]. However, this condition is rather weak and it is a standard assumption in convex optimization problems with constraints, see e.g. [3].

The rest of the paper is organized as follows. In Section 2, we present the control problem that will be considered throughout this work. The assumptions and the convex programming formulation of a constrained discrete-time MDP under the ETR criterion is introduced in Section 3. Important properties of the convex program as well as the constrained control problem are established in Section 4. Our main results are presented in Section 5 showing that the original control problem is equivalent to the convex program. Section 6 is dedicated to the presentation of an example illustrating our results. Finally, a technical result used in Section 4 is derived in an appendix.

2 Description of the control problem

The main goal of this section is to introduce the notation, the parameters defining the model, and to present the construction of the controlled process.

2.1 Notation and terminology

The following basic notation will be used in the forthcoming.

The set of integers is denoted by $\mathbb{Z}$ and $\mathbb{N}$ corresponds to the non-negative integers, that is, $\mathbb{N}=\{0,1,2,\ldots\}$ . The set of real numbers is given by $\mathbb{R}$ . For any subset $\mathbb{D}$ of $\mathbb{R}$ , $\mathbb{D}^{*}$ denotes $\mathbb{D}\setminus\{0\}$ and $\mathbb{D}_{+}=\{d\in\mathbb{D}:d\geq 0\}$ . We write $\mathbb{N}_{p}$ for $\{1,\ldots,p\}$ with $p\in\mathbb{N}^{*}$ , $\overline{\mathbb{R}}$ is the set of extended real numbers, that is, $\mathbb{R}\cup\{-\infty,+\infty\}$ and $\overline{\mathbb{R}}_{+}=\mathbb{R}_{+}\cup\{+\infty\}$ . Given $x$ and $y$ in the Euclidean space $\mathbb{R}^{n}$ , let $\langle x,y\rangle$ be the usual inner product of $x$ and $y$ . By $|x|=\langle x,x\rangle^{1/2}$ we will denote the norm of $x\in\mathbb{R}^{n}$ . Let $\mathbf{0}_{n}$ be the element of $\mathbb{R}^{n}$ with all components equal to zero. If $\theta_{1}$ and $\theta_{2}$ are in $\mathbb{R}^{n}$ , we shall write $\theta_{1}\geq\theta_{2}$ when all the components of $\theta_{1}$ are greater than or equal to the corresponding components of $\theta_{2}$ .

Let $X$ be a metric space and denote by $\boldsymbol{\mathfrak{B}}(X)$ its associated Borel $\sigma$ -algebra. We use the symbol $f^{+}$ (respectively $f^{-}$ ) to denote the positive part (respectively, negative part) of a function $f:X\rightarrow\overline{\mathbb{R}}$ . The function $\mathcal{I}_{\infty}$ is the function whose values are constant and equal to $+\infty$ . If $X$ is a metric space, $\boldsymbol{\mathscr{M}}(X)$ denotes the set of real-valued measurable functions defined on $X$ . Furthermore, $\boldsymbol{\mathcal{C}}(X)$ is the space of real-valued bounded continuous functions defined on $X$ . The term measure will always refer to a countably additive, $\overline{\mathbb{R}}_{+}$ -valued set function. The set of measures defined on $(X,\boldsymbol{\mathfrak{B}}(X))$ is denoted by $\boldsymbol{\mathcal{M}}(X)$ and the set of probability measures on $(X,\boldsymbol{\mathfrak{B}}(X))$ by $\boldsymbol{\mathcal{P}}(X)$ . For $\mu\in\boldsymbol{\mathcal{M}}(X)$ and a positive function $h$ in $\boldsymbol{\mathscr{M}}(X)$ , $\mu(h)=\int_{X}h(x)\mu(dx)$ and for $g\in\boldsymbol{\mathscr{M}}(X)$ , $\mu(g)$ is defined by $\mu(g^{+})-\mu(g^{-})$ where by convention $(+\infty)-(+\infty)=-\infty$ . Consider two metric spaces $X$ and $Y$ . If $\mu$ is a measure on $X\times Y$ then $\mu_{X}$ denotes the marginal of the measure $\mu$ on $X$ . A kernel $K$ on $X$ given $Y$ is a $\overline{\mathbb{R}}_{+}$ -valued mapping defined on $\boldsymbol{\mathfrak{B}}(X)\times Y$ such that for any $y\in Y$ , $K(\cdot|y)\in\boldsymbol{\mathcal{M}}(X)$ and for any $\Lambda\in\boldsymbol{\mathfrak{B}}(X)$ , $K(\Lambda|\cdot)$ is a measurable function defined on $Y$ . A kernel $K$ on $X$ given $Y$ is said to be finite if $K(X|y)\in\mathbb{R}_{+}$ for any $y\in Y$ . The set of finite kernels on $X$ given $Y$ is denoted $\boldsymbol{\mathcal{K}}(X|Y)$ . A stochastic (or Markov) kernel $K$ on $X$ given $Y$ is a kernel in $\boldsymbol{\mathcal{K}}(X|Y)$ satisfying $K(X|y)=1$ for any $y\in Y$ . The set of stochastic kernels on $X$ given $Y$ will be denoted by $\boldsymbol{\mathcal{P}}(X|Y)$ . Let $Q$ be a stochastic kernel on $X$ given $Y$ , then, for a function $v:X\rightarrow\overline{\mathbb{R}}$ , we define $Qv:Y\rightarrow\overline{\mathbb{R}}$ as

[TABLE]

provided that $v$ is quasi-integrable with respect to the probability measure $Q(\cdot|y)$ for any $y\in Y$ . For a measure $\mu$ on $Y$ , we denote by $\mu Q$ the measure $\displaystyle\int_{Y}Q(\cdot|y)\mu(dy)$ on $X$ .

2.2 The control model.

Let us consider the stationary model

[TABLE]

consisting of:

(a)

A Borel space $\mathbf{X}$ (that is, a Borel subset of a complete and separable metric space), which is the state space. 2. (b)

A Borel space $\mathbf{A}$ , representing the control or action set. 3. (c)

A family $\{\mathbf{A}(x):x\in\mathbf{X}\}$ of non-empty measurable subsets of $\mathbf{A}$ , where $\mathbf{A}(x)$ is the set of feasible controls or actions when the system is in state $x\in\mathbf{X}$ . We suppose that

[TABLE]

is a measurable subset of $\mathbf{X}\times\mathbf{A}$ . There exists a measurable map $\vartheta:\mathbf{X}\rightarrow\mathbf{A}$ with $\vartheta(x)\in\mathbf{A}(x)$ . For notational convenience, we introduce recursively the set $\mathbf{H}_{t}$ of histories up to time $t\in\mathbb{N}^{*}$ by defining $\mathbf{H}_{1}=\mathbf{X}$ and $\mathbf{H}_{t}=\mathbf{K}^{t-1}\times\mathbf{X}$ for $t\geq 2$ . 4. (d)

A stochastic kernel $Q$ on $\mathbf{X}$ given $\mathbf{K}$ , which stands for the transition probability function. 5. (e)

The one-step reward function is given by a measurable function $r:\mathbf{K}\rightarrow\mathbb{R}$ . 6. (f)

For $i\in\mathbb{N}_{q}$ , the measurable mappings $c_{i}:\mathbf{K}\rightarrow\mathbb{R}$ are the one-step constraint functions. 7. (g)

The constraint limits are real numbers given by $\theta^{*}=\big{\{}\theta^{*}_{i}\big{\}}_{i\in\mathbb{N}_{q}}$ . 8. (h)

Finally, the initial distribution is $\nu\in\boldsymbol{\mathcal{P}}(\mathbf{X})$ .

A control policy (a policy, for short) is a sequence $\pi=\{\pi_{t}\}_{t\in\mathbb{N}^{*}}$ of stochastic kernels $\pi_{t}$ on $\mathbf{A}$ given $\mathbf{H}_{t}$ such that $\pi_{t}(\mathbf{A}(x_{t})|h_{t})=1$ for any $h_{t}=(x_{1},a_{1},\ldots,x_{t})\in\mathbf{H}_{t}$ . Let $\Pi$ be the set of all policies. A policy $\pi=\{\pi_{t}\}_{t\in\mathbb{N}^{*}}\in\Pi$ is called a stationary randomized policy if there exists a stochastic kernel $\boldsymbol{\varphi}$ on $\mathbf{A}$ given $\mathbf{X}$ satisfying $\boldsymbol{\varphi}(\mathbf{A}(x)|x)=1$ for any $x\in\mathbf{X}$ and $\pi_{t}(\cdot|h_{t})=\boldsymbol{\varphi}(\cdot|x_{t})$ for any $h_{t}=(x_{1},a_{1},\ldots,x_{t})\in\mathbf{H}_{t}$ and $t\in\mathbb{N}^{*}$ . In such as case, we will write $\boldsymbol{\varphi}$ instead of $\pi$ to emphasize that the corresponding stationary randomized policy $\pi$ is generated by $\boldsymbol{\varphi}$ . Let $\Pi_{s}$ be the set of all stationary randomized policies.

To state the optimal control problem we are concerned with, we introduce the canonical space $(\Omega,\mathcal{F})$ consisting of the set of sample paths $\Omega=(\mathbf{X}\times\mathbf{A})^{\infty}$ and the associated product $\sigma$ -algebra $\mathcal{F}$ . The projection from $\Omega$ to the state space and the action space at time $t$ are denoted by $X_{t}$ and $A_{t}$ . That is, for

[TABLE]

for $t\in\mathbb{N}^{*}$ . Consequently, $\{X_{t}\}_{t\in\mathbb{N}^{*}}$ is the state process and $\{A_{t}\}_{t\in\mathbb{N}^{*}}$ is the control process. It is a well known result that for every policy $\pi\in\Pi$ and any initial probability measure $\nu$ on $(\mathbf{X},\boldsymbol{\mathfrak{B}}(\mathbf{X}))$ there exists a unique probability measure $\mathbb{P}_{\nu}^{\pi}$ on $(\Omega,\mathcal{F})$ such that $\mathbb{P}_{\nu}^{\pi}(\mathbf{K}^{\infty})=1$ and

[TABLE]

$\mathbb{P}_{\nu}^{\pi}-a.s.$ , for any $t\in\mathbb{N}^{*}$ .

The expectation with respect to $\mathbb{P}_{\nu}^{\pi}$ is denoted by $\mathbb{E}_{\nu}^{\pi}$ . The so-called occupation measure generated by a policy $\pi\in\Pi$ , denoted by $\mu^{\pi}$ , is defined by

[TABLE]

for any $\Gamma\in\boldsymbol{\mathfrak{B}}(\mathbf{X}\times\mathbf{A})$ . Denote by $\boldsymbol{\mathcal{O}}$ (respectively, $\boldsymbol{\mathcal{O}}_{s}$ ) the set of occupation measures generated by randomized (respectively, stationary) policies.

Statement of the control problem.

For $h\in\boldsymbol{\mathscr{M}}(\mathbf{K})$ and $\pi\in\Pi$ , define $\mathcal{J}_{\nu}(h,\pi)$ by

[TABLE]

where by convention $(+\infty)-(+\infty)=-\infty$ . In fact, assumptions will be introduced in the next section to avoid dealing with such cases. Observe that $\mathcal{J}_{\nu}(h,\pi)$ can be written equivalently in terms of the occupation measure generated by the policy $\pi\in\Pi$ as follows

[TABLE]

In this paper, we will repeatedly use this equality without mentioning it.

Definition 2.1

A policy $\pi\in\Pi$ is said to be admissible if $\mathcal{J}_{\nu}(c_{i},\pi)\geq\theta^{*}_{i}$ for $i\in\mathbb{N}_{q}$ . The set of admissible policies will be denoted by $\Pi_{\theta^{*}}$ . The optimal control problem we consider consists in maximizing the expected reward $\mathcal{J}_{\nu}(r,\pi)$ over the set of admissible policies $\pi\in\Pi_{\theta^{*}}$ . The value associated to this constrained control problem is given by $\sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}}$ . A policy $\hat{\pi}\in\Pi$ is optimal if $\hat{\pi}\in\Pi_{\theta^{*}}$ and $\mathcal{J}_{\nu}(r,\hat{\pi})=\sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}}$ .

3 Assumptions and the convex programming formulation

The objective of this section is both to list the assumptions we will use in this work and to introduce the convex program associated with the control problem we presented in the previous section. In this work, we deal with MDPs satisfying the so-called Conditions (W) or (S) which are standard hypotheses of the literature, see for example [16].

Condition (W)

(W1)

For any $x\in\mathbf{X}$ , the action set $\mathbf{A}(x)$ is compact and the multifunction from $\mathbf{X}$ to $\mathbf{A}$ defined by $x\rightarrow\mathbf{A}(x)$ is upper-semicontinuous.

(W2)

For any $f\in\boldsymbol{\mathcal{C}}(\mathbf{X})$ , $Qf$ is continuous on $\mathbf{K}$ .

(W3)

The reward $r$ and the constraint $c_{i}$ for $i\in\mathbb{N}_{q}$ are upper-semicontinuous on $\mathbf{K}$ .

Condition (S)

(S1)

For any $x\in\mathbf{X}$ , $\mathbf{A}(x)$ is compact.

(S2)

For any $x\in\mathbf{X}$ and $\Lambda\in\boldsymbol{\mathfrak{B}}(\mathbf{X})$ , $Q(\Lambda|x,\cdot)$ is continuous on $\mathbf{A}(x)$ .

(S3)

For any $x\in\mathbf{X}$ , the reward $r(x,\cdot)$ and the constraint $c_{i}(x,\cdot)$ for $i\in\mathbb{N}_{q}$ are upper-semicontinuous on $\mathbf{A}(x)$ .

In order to introduce the convex program associated to an MDP under the ETR criterion, we need to make some hypotheses. First, it is assumed that the transition kernel $Q$ of the MDP under consideration is absolutely continuous with respect to a Markov kernel $P$ (see Assumption 3). This hypothesis is rather weak and is satisfied in a large number of practical cases as discussed in the remark below.

Assumption A.

There exists $P\in\boldsymbol{\mathcal{P}}(\mathbf{X}|\mathbf{X})$ satisfying $Q(\cdot|x,a)\ll P(\cdot|x)$ for any $(x,a)\in\mathbf{K}$ . Associated to the kernel $P$ , $p$ will denote the probability measure on $\mathbf{X}$ defined by

[TABLE]

Remark 3.1

In Lemma 3.2 below, it is shown that under Conditions (S1) and (S2), Assumption 3 is satisfied. 2. 2.

If the sets of feasible actions are countable, that is $\mathbf{A}(x)=\{a_{k}(x)\}_{k\in\mathbb{N}^{*}}$ where for any $k\in\mathbb{N}^{*}$ $a_{k}$ is a measurable function from $\mathbf{X}$ to $\mathbf{A}$ then Assumption 3 is satisfied for $P$ defined by

[TABLE]

for any $x\in\mathbf{X}$ . 3. 3.

If $Q(\cdot|x,a)\ll q(\cdot)$ for any $(x,a)\in\mathbf{K}$ then clearly Assumption 3 is satisfied. This condition corresponds to the main hypothesis used in **[8*]**. It is of course less general than Assumption 3 but it is naturally satisfied for a large class of practical systems. Indeed, in many applications, the evolution of an MDP is specified by a discrete-time equation of the form $x_{t+1}=F(x_{t},a_{t})+\xi_{t}$ where $F$ is an $\mathbb{R}^{n}$ -valued measurable mapping defined on $\mathbb{R}^{n}\times A$ and $(\xi_{t})_{t\in\mathbb{N}^{*}}$ is an independent and identically distributed sequence of random variables with density $\alpha$ with respect to the Lebesgue measure on $\boldsymbol{\mathfrak{B}}(\mathbb{R}^{n})$ . By using the change of variable formula, we obtain that $\displaystyle Q(A|x,a)=\int_{A}\alpha(y-F(x,a))dy$ showing that $Q(\cdot|x,a)\ll q(\cdot)$ for any $(x,a)\in\mathbf{K}$ is satisfied for $q$ defined for example by the standard normal distribution on $\boldsymbol{\mathfrak{B}}(\mathbb{R}^{n})$ .

Observe also that when $\mathbf{X}$ is finite or countable, $Q(\cdot|x,a)\ll q(\cdot)$ for any $(x,a)\in\mathbf{K}$ is satisfied when $q$ is given for example by a geometric distribution.

Lemma 3.2

Conditions (S1) and (S2) imply Assumption 3, that is, $Q\ll P$ with $P\in\boldsymbol{\mathcal{P}}(\mathbf{X}|\mathbf{X})$ given by

[TABLE]

where $\{\xi_{k}\}_{k\in\mathbb{N}^{*}}$ is a sequence of measurable selectors from the multifunction defined from $\mathbf{X}$ to $\mathbf{A}$ by $x\rightarrow\mathbf{A}(x)$ and satisfying $\mathbf{A}(x)=\overline{\{\xi_{n}(x):n\in\mathbb{N}^{*}\}}$ for any $x\in\mathbf{X}$ .

Proof: The multifunction $\boldsymbol{\mathfrak{A}}$ from $\mathbf{X}$ to $\mathbf{A}$ defined by $x\rightarrow\mathbf{A}(x)$ is by assumption Borel measurable and so, weakly measurable. From (S1), Corollary 18.15 in [1] gives the existence of a sequence $\{\xi_{n}\}_{n\in\mathbb{N}^{*}}$ of measurable selectors from the multifunction $\boldsymbol{\mathfrak{A}}$ satisfying $\mathbf{A}(x)=\overline{\{\xi_{n}(x):n\in\mathbb{N}^{*}\}}$ for any $x\in\mathbf{X}$ . Now by using (S2), we obtain that $Q(dy|x,a)\ll P(dy|x)$ for any $(x,a)\in\mathbf{K}$ for the Markov kernel $P$ defined by (3). $\Box$

Remark 3.3

The previous proof is an extension of an argument used in the proof of Theorem 1 in [13, p. 183].

In the next definition, we introduce the set of feasible variables. It will be shown below that it is a convex subset of the vector space of finite signed kernels on $\mathbf{A}$ given $\mathbf{X}$ .

Definition 3.4

Suppose Assumption 3 holds and let $p$ be the measure introduced in (2).

•

For $\Phi=(\varphi^{\infty},\varphi^{*})\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})^{2}$ , $\eta^{\Phi}$ will denote the measure in $\boldsymbol{\mathcal{M}}(\mathbf{X}\times\mathbf{A})$ given by

[TABLE]

recalling that $\mathcal{I}_{\infty}$ is constant function equal to infinity.

•

Consider $\boldsymbol{\mathcal{K}}_{p}$ as the set of $\Phi=(\varphi^{\infty},\varphi^{*})\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})^{2}$ satisfying

[TABLE]

and

[TABLE]

Any $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ induces a measure $\eta^{\Phi}$ that will be called the $\boldsymbol{\mathcal{K}}_{p}$ -measure generated by $\Phi$ . $\boldsymbol{\mathcal{K}}_{p}$ is called the set of feasible variables.

Remark 3.5

Observe first that $\alpha\Phi_{1}+(1-\alpha)\Phi_{2}\in\boldsymbol{\mathcal{K}}_{p}$ and in particular,

[TABLE]

for any $\alpha\in[0,1]$ and $(\Phi_{1},\Phi_{2})\in\boldsymbol{\mathcal{K}}_{p}^{2}$ . Therefore, $\boldsymbol{\mathcal{K}}_{p}$ is a convex subset of the vector space of signed finite kernel on $\mathbf{A}$ given $\mathbf{X}$ .

Definition 3.6

Let $\Phi=(\varphi^{\infty},\varphi^{*})\in\boldsymbol{\mathcal{K}}_{p}$ . Introduce the kernel $\varphi_{\Phi}$ on $\mathbf{A}$ given $\mathbf{X}$ defined by

[TABLE]

where

[TABLE]

Observe that $\varphi_{\Phi}$ is a stochastic kernel satisfying $\varphi_{\Phi}(\mathbf{A}(x)|x)=1$ for any $x\in\mathbf{X}$ . The stationary randomized policy $\varphi_{\Phi}$ will be called the policy induced by $\Phi$ .

We will also need the following technical hypothesis:

Assumption B.
(B.1)

$\displaystyle\sup\big{\{}\eta^{\Phi}(r^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}$ and $\displaystyle\sup\big{\{}\eta^{\Phi}(c^{+}_{i}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty$ for any $i\in\mathbb{N}_{q}$ .

(B.2)

$\mu(r^{-})<+\infty$ and $\mu(c^{-}_{i})<+\infty$ for any $\mu\in\boldsymbol{\mathcal{O}}$ , $i\in\mathbb{N}_{q}$ .

This hypothesis is comparable to Assumption (A2) introduced in [8, p. 847]. Assumption (B.1) essentially imposes that the values of the unconstrained convex programs associated to a reward function given by either $r$ or $c_{i}$ for $i\in\mathbb{N}_{q}$ are different from $+\infty$ while Assumption (B.2) ensure that the performance criteria associated to the reward $r$ and the constraints $c_{i}$ for $i\in\mathbb{N}_{q}$ are not equal $-\infty$ . In particular, Assumption (B.1) will be used to introduce the linear program.

Definition 3.7

Suppose Assumptions 3 and (B.1) hold. The convex program, denoted by $\boldsymbol{\mathcal{KP}}_{p}$ , consists in maximizing $\eta^{\Phi}(r)$ over $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ subject to $\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}$ for any $i\in\mathbb{N}_{q}$ . The value of the convex program is given by

[TABLE]

A variable $\hat{\Phi}\in\boldsymbol{\mathcal{K}}_{p}$ is said to be an optimal solution to the convex program $\boldsymbol{\mathcal{KP}}_{p}$ if

[TABLE]

and $\eta^{\hat{\Phi}}(c_{i})\geq\theta^{*}_{i}$ for any $i\in\mathbb{N}_{q}$ .

Remark 3.8

Let $h$ be a function given by either $r$ or $c_{i}$ for $i\in\mathbb{N}_{q}$ . From Assumption (B.1), it follows that $\alpha\eta^{\Phi_{1}}(h)+(1-\alpha)\eta^{\Phi_{2}}(h)$ is well defined for any $\alpha\in[0,1]$ and $(\Phi_{1},\Phi_{2})\in\boldsymbol{\mathcal{K}}_{p}^{2}$ . Therefore, we obtain from equation (5) that

[TABLE]

for any $\alpha\in[0,1]$ and $(\Phi_{1},\Phi_{2})\in\boldsymbol{\mathcal{K}}_{p}^{2}$ . This implies that the mathematical program defined in (8) is indeed a convex program. In [3, p. 153], a convex program is written in terms of an infimum. The $\boldsymbol{\mathcal{KP}}_{p}$ program introduced in Definition 3.7 can be equivalently written in terms of an infimum by changing the sign of the reward function. We prefer to keep this setting to deal with an MDP under a reward optimization criterion.

Finally, we introduce an additional standard hypothesis:

The Slater condition

There exists $\mu^{*}\in\boldsymbol{\mathcal{O}}$ such that $\theta^{*}_{i}<\mu^{*}(c_{i})$ for any $i\in\mathbb{N}_{q}$ .

4 Preliminary results

The main goal of this section is to establish several properties of the constrained control problem as well as properties of the convex program.

4.1 Properties of the convex program

In this subsection, we will show in Lemma 4.2 that for any stationary randomized policy $\pi\in\Pi_{s}$ there exists $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ such that the $\boldsymbol{\mathcal{K}}_{p}$ -measure generated by $\Phi$ is equal to the occupation measure generated by the stationary randomized policy $\pi$ . An important result which is a cornerstone of the paper is presented at the end of this subsection. It can be roughly stated as follows: for any feasible variable $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ of the convex program, the reward $\mathcal{J}_{\nu}(h,\varphi_{\Phi})$ associated to the stationary randomized policy $\varphi_{\Phi}\in\Pi_{s}$ is greater than $\eta^{\Phi}(h)$ for specific functions $h$ that will be discussed in Theorem 4.3. To get these results, we first need to establish that the occupation measures of the controlled process have a special structure, that is, the marginal on $\mathbf{X}$ of any occupation measure is absolutely continuous with respect to the probability measure $p$ introduced in Assumption 3.

Lemma 4.1

Suppose Assumption 3 holds. Then for any $\mu\in\boldsymbol{\mathcal{O}}$ ,

[TABLE]

where $p\in\boldsymbol{\mathcal{P}}(\mathbf{X})$ is defined in (2).

Proof: For any $\mu\in\boldsymbol{\mathcal{O}}$ , it can be easily shown from Lemma 9.4.3 in [11] the existence of an increasing sequence of finite measures $\{\mu_{k}\}_{k\in\mathbb{N}^{*}}$ on $\mathbf{X}$ and a sequence of stochastic kernels $\{\varphi_{k}\}_{k\in\mathbb{N}^{*}}$ on $\mathbf{A}$ given $\mathbf{X}$ satisfying $\varphi_{k}(\mathbf{A}(x)|x)=1$ and

[TABLE]

and

[TABLE]

for $\Lambda\in\boldsymbol{\mathfrak{B}}(\mathbf{X})$ , $k\in\mathbb{N}^{*}$ and $\mu_{1}=\nu$ . Let us show by induction that $\mu_{k}\ll p$ for any $k\in\mathbb{N}^{*}$ . We have clearly $\mu_{1}\ll p$ . Assume that $\mu_{k}\ll p$ . Observe that $\displaystyle\int_{\mathbf{A}}Q(\cdot|x,a)\varphi_{k}(da|x)\ll P(\cdot|x)$ for any $x\in\mathbf{X}$ implying that

[TABLE]

and so, combining (2) and (11) we have $\mu_{k+1}\ll p$ . We obtain the result by using (10). $\Box$

As a consequence, we can show that the set of the $\boldsymbol{\mathcal{K}}_{p}$ -measures contains the occupation mesures generated by the stationary randomized policies.

Lemma 4.2

Suppose Assumption 3 holds. For any $\pi\in\Pi_{s}$ , there exists $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ such that

[TABLE]

Proof: Let $\pi\in\Pi_{s}$ . Clearly, the increasing sequence $\{\mu^{\pi}_{t}\}_{t\in\mathbb{N}^{*}}$ of finite measures defined on $\mathbf{X}\times\mathbf{A}$ by

[TABLE]

for any $\Gamma\in\boldsymbol{\mathfrak{B}}(\mathbf{X}\times\mathbf{A})$ converges to $\mu^{\pi}$ . From Lemma 4.1, there exists a sequence of increasing measurable $\mathbb{R}_{+}$ -valued functions $\{\mathcal{D}_{t}\}_{t\in\mathbb{N}^{*}}$ defined on $\mathbf{X}$ such that $\displaystyle\sum_{k=1}^{t}\mathbb{P}_{\nu}^{\pi}(X_{k}\in\Lambda)=\int_{\Lambda}\mathcal{D}_{t}(x)p(dx)$ for $\Lambda\in\boldsymbol{\mathfrak{B}}(\mathbf{X})$ and so, $\mu^{\pi}_{t}(dx,da)=\mathcal{D}_{t}(x)\pi(da|x)p(dx)$ . Therefore,

[TABLE]

where $\mathcal{D}(x)=\lim_{t\rightarrow\infty}\mathcal{D}_{t}(x)$ . Consequently, $\Phi=(\varphi^{\infty},\varphi^{*})$ defined by $\varphi^{\infty}(da|x)=I_{\{\mathcal{D}(x)=\infty\}}\pi(da|x)$ and $\varphi^{*}(da|x)=\mathcal{D}(x)I_{\{\mathcal{D}(x)<\infty\}}\pi(da|x)$ belongs to $\boldsymbol{\mathcal{K}}_{p}$ since $\mu^{\pi}_{\mathbf{X}}=\nu+\mu^{\pi}Q$ . $\Box$

The following result is in a way a converse of the previous one. It is a key result in our work. Roughly speaking, it states that for any feasible variable $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ of the convex program, the reward $\mathcal{J}_{\nu}(h,\varphi_{\Phi})$ associated to the stationary randomized policy $\varphi_{\Phi}\in\Pi_{s}$ is greater than $\eta^{\Phi}(h)$ for specific functions $h$ described below.

Theorem 4.3

Suppose that Assumption 3 holds. For any $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ , there exists $\varphi_{\Phi}\in\Pi_{s}$ such that

[TABLE]

for any $h\in\boldsymbol{\mathscr{M}}(\mathbf{K})$ satisfying $\displaystyle\sup\big{\{}\eta^{\Phi}(h^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty$ .

Proof: For $h\in\boldsymbol{\mathscr{M}}(\mathbf{K})$ satisfying $\displaystyle\sup\big{\{}\eta^{\Phi}(h^{+}):\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}<+\infty$ , let us prove the result by showing that

[TABLE]

where $\varphi_{\Phi}$ is the stationary randomized policy induced by $\Phi$ (see (6)). There is no loss of generality to assume that $\eta^{\Phi}(h)>-\infty$ and so we have $\eta^{\Phi}(|h|)<\infty$ . We are going to proceed by contradiction to get (12). More precisely, if $\mu^{\varphi_{\Phi}}(h)<\eta^{\Phi}(h)$ then we will introduce a sequence $\{\Psi_{k}\}_{k\in\mathbb{N}}$ in $\boldsymbol{\mathcal{K}}_{p}$ satisfying $\displaystyle\lim_{k\rightarrow\infty}\eta^{\Psi_{k}}(h)=+\infty$ contradicting the hypothesis. The proof is divided into two steps. We will first introduce $\{\Psi_{k}\}_{k\in\mathbb{N}}$ and show that $\Psi_{k}\in\boldsymbol{\mathcal{K}}_{p}$ for any $k\in\mathbb{N}$ . In a second step, it will be proven that $\displaystyle\lim_{k\rightarrow\infty}\eta^{\Psi_{k}}(h)=+\infty$ showing the result.

First step: construction of a sequence $\{\Psi_{k}\}_{k\in\mathbb{N}}$ in $\boldsymbol{\mathcal{K}}_{p}$ .

Let $\mu^{\varphi_{\Phi}}$ be the occupation measure induced by the stationary randomized policy $\varphi_{\Phi}$ . As in the proof of Lemma 4.2, there exists a measurable $\overline{\mathbb{R}}_{+}$ -valued function $\mathcal{D}_{\varphi_{\Phi}}$ defined on $\mathbf{X}$ satisfying

[TABLE]

For $k\in\mathbb{N}$ , consider $\Psi_{k}=(\psi^{\infty},\psi^{*}_{k})$ where $\psi^{\infty}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})$ is given by

[TABLE]

and $\psi^{*}_{k}$ is a signed kernel on $\mathbf{A}$ given $\mathbf{X}$ defined by

[TABLE]

Observe that in the previous definition, $\varphi^{*}(\mathbf{A}|x)-\mathcal{D}_{\varphi_{\Phi}}(x)$ is well defined since $\varphi^{*}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})$ . To get the result, we will proceed in two steps. First we will show that $\varphi^{*}(\mathbf{A}|\cdot)\geq\mathcal{D}_{\varphi_{\Phi}}(\cdot)$ on $\boldsymbol{\mathcal{E}}_{\Phi}$ implying that $\psi^{*}_{k}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})$ and so, $\Psi_{k}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})^{2}$ for any $k\in\mathbb{N}$ . In a second step, we will prove that $\Psi_{k}\in\boldsymbol{\mathcal{K}}_{p}$ .

$\bullet$ Let us show that $\Psi_{k}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})^{2}$ .

From (4), $\eta^{\Phi}(dx,da)=\mathcal{I}_{\infty}(x)\varphi^{\infty}(da|x)p(dx)+\varphi^{*}(da|x)p(dx)$ and so, by using (7)

[TABLE]

where by convention $0\times\infty=0$ . Recalling the Definition of $\varphi_{\Phi}$ (see equation (6)), we easily obtain $I_{\boldsymbol{\mathcal{E}}_{\Phi}}(x)\varphi^{*}(da|x)=I_{\boldsymbol{\mathcal{E}}_{\Phi}}(x)\varphi^{*}(\mathbf{A}|x)\varphi_{\Phi}(da|x)$ and $I_{\boldsymbol{\mathcal{E}}_{\Phi}^{c}}(x)\mathcal{I}_{\infty}(x)\varphi^{\infty}(da|x)=I_{\boldsymbol{\mathcal{E}}_{\Phi}^{c}}(x)\mathcal{I}_{\infty}(x)\varphi_{\Phi}(da|x)$ and so, we get

[TABLE]

Therefore,

[TABLE]

Since $\eta^{\Phi}_{\mathbf{X}}=\nu+\eta^{\Phi}Q$ , we have by using (16)

[TABLE]

and with (17) it follows

[TABLE]

However, $\mu^{\varphi_{\Phi}}_{\mathbf{X}}$ is the minimal solution to the equation $\beta=\nu+\beta Q^{\varphi_{\Phi}}$ and so, $\mu^{\varphi_{\Phi}}_{\mathbf{X}}\leq\eta^{\Phi}_{\mathbf{X}}$ . Combining equations (13) and (17), we obtain $\Big{[}I_{\boldsymbol{\mathcal{E}}_{\Phi}}(\cdot)\varphi^{*}(\mathbf{A}|\cdot)+I_{\boldsymbol{\mathcal{E}}_{\Phi}^{c}}(\cdot)\mathcal{I}_{\infty}(\cdot)\Big{]}\geq\mathcal{D}_{\varphi_{\Phi}}(\cdot)$ $p-a.s.$ . Consequently, $\mathcal{D}_{\varphi_{\Phi}}(\cdot)\leq\varphi^{*}(\mathbf{A}|\cdot)$ $p-a.s.$ on $\boldsymbol{\mathcal{E}}_{\Phi}$ and according to the definition of $\mathcal{D}_{\varphi_{\Phi}}(\cdot)$ (see equation (13)), there is no loss of generality to claim

[TABLE]

Therefore, $\psi^{*}_{k}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})$ and so, $\Psi_{k}\in\boldsymbol{\mathcal{K}}(\mathbf{A}|\mathbf{X})^{2}$ for any $k\in\mathbb{N}$ .

$\bullet$ Let us show that $\Psi_{k}\in\boldsymbol{\mathcal{K}}_{p}$ .

Recalling the definition $\Psi_{k}$ (see equations (14)-(15)), we have $\psi^{\infty}(\mathbf{A}(x)^{c}|x)+\psi^{*}_{k}(\mathbf{A}(x)^{c}|x)=0$ and $\psi^{\infty}(\mathbf{A}|x)+\psi^{*}_{k}(\mathbf{A}|x)\geq I_{\boldsymbol{\mathcal{E}}_{\Phi}}(x)\varphi^{*}(\mathbf{A}|x)+I_{\boldsymbol{\mathcal{E}}_{\Phi}^{c}}(x)>0$ for any $x\in\mathbf{X}$ . The only point which remains to prove is that $\eta^{\Psi_{k}}(dx,da)=\mathcal{I}_{\infty}(x)\psi^{\infty}(da|x)p(dx)+\psi^{*}_{k}(da|x)p(dx)$ satisfies

[TABLE]

Combining the definition of $\Psi_{k}$ (see equations (14)-(15)) and the expression of $\eta^{\Phi}$ (see equation (16)), we obtain

[TABLE]

where $\gamma\in\boldsymbol{\mathcal{M}}(\mathbf{X}\times\mathbf{A})$ is given by

[TABLE]

To show that (20) holds, we will consider two cases.

a) Firstly, we will show that equation (20) is satisfied on $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ . For that, let us consider $\Lambda\in\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ . From (21), we have $\eta^{\Psi_{k}}_{\mathbf{X}}(\Lambda)=\eta^{\Phi}_{\mathbf{X}}(\Lambda)+k\gamma_{\mathbf{X}}(\Lambda)$ . However, $\eta^{\Phi}_{\mathbf{X}}(\Lambda)=\nu(\Lambda)+\eta^{\Phi}Q(\Lambda)$ showing that $\eta^{\Psi_{k}}_{\mathbf{X}}(\Lambda)=\nu(\Lambda)+\eta^{\Phi}Q(\Lambda)+k\gamma_{\mathbf{X}}(\Lambda)$ . If we show that $\gamma_{\mathbf{X}}(\Lambda)=\gamma Q(\Lambda)$ then $\eta^{\Psi_{k}}_{\mathbf{X}}(\Lambda)=\nu(\Lambda)+\eta^{\Psi_{k}}Q(\Lambda)$ implying that (20) holds on $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ . To see that $\gamma_{\mathbf{X}}(\Lambda)=\gamma Q(\Lambda)$ , observe from (22) that

[TABLE]

Assuming that $\eta^{\Phi}_{\mathbf{X}}(\Lambda)<\infty$ and combining (13), (17) and the previous equation we have

[TABLE]

Now, we obtain by using (18) and the fact that $\eta^{\Phi}_{\mathbf{X}}(\Lambda)<\infty$

[TABLE]

implying also

[TABLE]

Now, combining (18) and (24)

[TABLE]

Recalling that $\mu^{\varphi_{\Phi}}_{\mathbf{X}}=\nu+\mu^{\varphi_{\Phi}}_{\mathbf{X}}Q^{\varphi_{\Phi}}$ , we have with (13) and (25)

[TABLE]

The two previous equations gives

[TABLE]

From (23) and (26)

[TABLE]

Recalling the definition of $\gamma$ (see (22)) we get $\gamma_{\mathbf{X}}(\Lambda)=\gamma Q(\Lambda)$ for $\Lambda\in\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ with $\eta^{\Phi}_{\mathbf{X}}(\Lambda)<\infty$ . However, equation (17) implies that $\eta^{\Phi}_{\mathbf{X}}$ is $\sigma$ -finite on $\boldsymbol{\mathcal{E}}_{\Phi}$ and combining (16) and (22), we have $\gamma_{\mathbf{X}}\leq\eta^{\Phi}_{\mathbf{X}}$ . Therefore, it follows that $\gamma_{\mathbf{X}}(\Lambda)=\gamma Q(\Lambda)$ for any $\Lambda$ in $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ , and so (20) holds on $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi})$ .

b) Secondly, we will show that equation (20) is satisfied on $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi}^{c})$ . For that, let $\Lambda\in\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi}^{c})$ . It is important to observe from (17) that in this case $\eta^{\Phi}_{\mathbf{X}}(\Lambda)=0$ or $+\infty$ . Therefore, we obtain on one hand $\eta^{\Psi_{k}}_{\mathbf{X}}(\Lambda)=\eta^{\Phi}_{\mathbf{X}}(\Lambda)+k\gamma_{\mathbf{X}}(\Lambda)=\eta^{\Phi}_{\mathbf{X}}(\Lambda)$ by recalling (21) and using the fact that $\gamma_{\mathbf{X}}\leq\eta^{\Phi}_{\mathbf{X}}$ and on the other hand $\eta^{\Phi}_{\mathbf{X}}(\Lambda)=\eta^{\Phi}_{\mathbf{X}}(\Lambda)+k\gamma Q(\Lambda)$ since by (22)

[TABLE]

where the last inequality comes from (18). Therefore, $\eta^{\Psi_{k}}_{\mathbf{X}}(\Lambda)=\eta^{\Phi}_{\mathbf{X}}(\Lambda)+k\gamma Q(\Lambda)=\nu\Lambda)+\eta^{\Psi_{k}}Q(\Lambda)$ showing that (20) holds on $\boldsymbol{\mathfrak{B}}(\boldsymbol{\mathcal{E}}_{\Phi}^{c})$ .

Finally, equation (20) is satisfied and as a consequence $\Psi_{k}\in\boldsymbol{\mathcal{K}}_{p}$ for any $k\in\mathbb{N}$ .

Second step: $\displaystyle\lim_{k\rightarrow\infty}\eta^{\Psi_{k}}(h)=+\infty$ .

Recalling that $\eta^{\Phi}(|h|)<\infty$ , we get from (16)

[TABLE]

implying also

[TABLE]

Therefore, combining (13), (16), (22) and the two previous equations we obtain easily that

[TABLE]

If $\eta^{\Phi}(h)>\mu^{\varphi_{\Phi}}(h)$ then $\gamma(h)>0$ and $\displaystyle\lim_{k\rightarrow\infty}\eta^{\Psi_{k}}(h)=\eta^{\Phi}(h)+\lim_{k\rightarrow\infty}k\gamma(h)=+\infty$ giving the result. $\Box$

4.2 Properties of the constrained control problem

The main objective of this subsection is to show that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. Our results use Theorem A.1 presented in the Appendix which is a slight modification of Theorem 1 in Schäl [17] who has established a stronger version of this type of result but in the unconstrained case. To use Schäl’s results, we need to impose Conditions (W) or (S) and in addition, to deal with the constrained case, we need to impose a Slater-type condition.

The next technical Lemma shows that, roughly speaking, under Assumption (B.1), the unconstrained control problems associated to a reward function given by either $r$ or $c_{i}$ for $i\in\mathbb{N}_{q}$ are different from $+\infty$ .

Lemma 4.4

Suppose Assumptions 3 and (B.1) and either Conditions (W) or (S) hold. Then,

[TABLE]

for $i\in\mathbb{N}_{q}$ .

Proof: The idea is to apply Theorem A.1 to the unconstrained models associated to the reward functions given by one of the following mappings: $r^{+}$ and $c^{+}_{i}$ for $i\in\mathbb{N}_{q}$ . Clearly, the Convergence Assumption and the Continuity and Compactness Assumptions in [17, p. 367] are satisfied. Therefore, we have by using Theorem A.1

[TABLE]

for any function $h$ given by either $r^{+}$ or $c^{+}_{i}$ for $i\in\mathbb{N}_{q}$ . Now, from Assumption 3 we can apply Lemma 4.2 to have

[TABLE]

Recalling Assumption (B.1) we obtain the result. $\Box$

The next result shows that if the Slater condition is satisfied for an arbitrary policy then there exists a stationary randomized policy satisfying the same type of condition.

Proposition 4.5

Suppose Assumptions 3, 3 and either Conditions (W) or (S) hold. If the Slater condition is satisfied, then there exists $\widetilde{\mu}\in\boldsymbol{\mathcal{O}}_{s}$ satisfying $\theta^{*}_{i}<\widetilde{\mu}(c_{i})$ for any $i\in\mathbb{N}_{q}$ .

Proof: The result is proved by induction. Applying Theorem A.1 for the unconstrained model associated to the reward function $c_{1}$ , we have

[TABLE]

Since $\mu^{*}(c_{1})>\theta_{1}^{*}$ (by recalling the Slater condition), we have $\sup\{\mu(c_{1}):\mu\in\boldsymbol{\mathcal{O}}_{s}\}>\theta_{1}^{*}$ implying the existence of $\mu_{1}\in\boldsymbol{\mathcal{O}}_{s}$ such that $\mu_{1}(c_{1})>\theta_{1}^{*}$ . For $n\in\mathbb{N}_{q-1}$ , let us assume the existence of $\mu_{n}\in\boldsymbol{\mathcal{O}}_{s}$ such that $\mu_{n}(c_{i})>\theta_{i}^{*}$ for $i\in\mathbb{N}_{n}$ . Therefore, we can combine Lemma 4.4 and Proposition A.2 to obtain

[TABLE]

However,

[TABLE]

implying the existence of $\mu_{n+1}\in\boldsymbol{\mathcal{O}}_{s}$ such that $\mu_{n+1}(c_{i})>\theta_{i}^{*}$ for $i\in\mathbb{N}_{n+1}$ . This gives the result. $\Box$

Below is the main result of this subsection that states roughly speaking that in the framework of constrained control problems, the supremums of the expected total rewards over the set of randomized policies and over the set of stationary randomized policies coincide.

Theorem 4.6

Suppose Assumptions 3, 3 and either Conditions (W) or (S) hold. If the Slater condition is satisfied, then

[TABLE]

Proof: Applying Proposition 4.5, there exists $\widetilde{\mu}\in\boldsymbol{\mathcal{O}}_{s}$ satisfying the Slater condition, that is, $\widetilde{\mu}(c_{i})>\theta^{*}_{i}$ for $i\in\mathbb{N}_{q}$ . Now, combining Lemma 4.4 and Proposition A.2, we obtain the result. $\Box$

5 Main results

In this section, we present the main results of this paper showing that the original control problem is equivalent to the convex program introduced in Definition 3.7 for a weakly or strongly continuous transition kernel.

The case of Condition (W)

Theorem 5.1

Suppose Assumptions 3, 3 and Condition (W) hold. If the Slater condition is satisfied, then

[TABLE]

where $p\in\boldsymbol{\mathcal{P}}(\mathbf{X})$ is defined in (2). Moreover, if $\hat{\Phi}$ is an optimal solution to the convex program $\boldsymbol{\mathcal{KP}}_{p}$ then the stationary randomized policy $\varphi_{\hat{\Phi}}$ induced by $\hat{\Phi}$ is optimal for the constrained control problem, that is,

[TABLE]

Proof: Theorem 4.6 states that

[TABLE]

However, from Lemma 4.2, we have

[TABLE]

Now, consider $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ . By using Theorem 4.3, $\mathcal{J}_{\nu}(h,\varphi_{\Phi})\geq\eta^{\Phi}(h)$ for $h$ given either $r$ or $c_{i}$ for $i\in\mathbb{N}_{q}$ implying that $\varphi_{\Phi}\in\Pi_{s}\mathop{\cap}\Pi_{\theta^{*}}$ and also the reverse inequality

[TABLE]

showing the first part of the result.

Now if $\hat{\Phi}\in\boldsymbol{\mathcal{K}}_{p}$ is an optimal solution to the convex program $\boldsymbol{\mathcal{KP}}_{p}$ then $\eta^{\hat{\Phi}}(c_{i})\geq\theta^{*}_{i}$ for any $i\in\mathbb{N}_{q}$ and $\eta^{\hat{\Phi}}(r)=\sup\big{\{}\eta^{\Phi}(r):\Phi\in\boldsymbol{\mathcal{K}}_{p}\text{ and }\eta^{\Phi}(c_{i})\geq\theta^{*}_{i}\text{ for }i\in\mathbb{N}_{q}\big{\}}$ . Therefore, the stationary randomized policy $\varphi_{\hat{\Phi}}\in\Pi_{\theta^{*}}$ satisfies $\mathcal{J}_{\nu}(r,\varphi_{\hat{\Phi}})\geq\eta^{\hat{\Phi}}(r)$ by using Theorem 4.3. Now, by using the first part of the result (see equation (27)) it follows that $\mathcal{J}_{\nu}(r,\varphi_{\hat{\Phi}})\geq\sup\big{\{}\mathcal{J}_{\nu}(r,\pi):\pi\in\Pi_{\theta^{*}}\big{\}}$ giving the last part of the result. $\Box$

Remark 5.2

As mentioned in the introduction, the previous result has the advantage of proposing a convex programming formulation for constrained MDPs under the ETR criterion with signed reward functions and satisfying condition (W) which has not been so far addressed in the literature. In [6], the authors do not really analyse a convex program, but study a related optimization problem where the MPDs under consideration satisfy condition (W) but the proposed approach strongly relies on the positiveness of the cost functions and cannot be generalized to the framework of signed cost functions.

The case of condition (S)

Theorem 5.3

Suppose Assumptions 3 and Condition (S) hold. If the Slater condition is satisfied, then

[TABLE]

where $p$ is defined in (2) for $P$ given by (3). Moreover, if $\hat{\Phi}$ is an optimal solution to the convex program then the stationary randomized policy $\varphi_{\hat{\Phi}}$ induced by $\hat{\Phi}$ is optimal for the constrained control problem introduced in Definition 2.1.

Proof: Up to the definition of $p$ whose existence is established in Lemma 3.2, the proof of this result is identical to that of Theorem 5.1. $\Box$

Remark 5.4

In [8], the authors do not really analyse a convex program but study a related optimization problem where the MPDs under consideration satisfy condition (S) by assuming that the transition kernel is absolutely continuous with respect to a reference probability measure uniformly in the state and action variables. In the previous result, we show that this assumption is not needed under condition (S) if this hypothesis is replaced by a Slater-type condition.

6 Example

In this section, we provide an example with one constraint to illustrate our results and compare them with reference [8]. The results obtained in [6] cannot be used for this model because the contraint function takes positive and negative values. We will show that one of the conditions of [8] is not satisfied while the approach developed in the present paper can be applied. This example shows that there is a gap between the initial optimization problem and the mathematical program associated to the measures satisfying the characteristic equation, that is,

[TABLE]

It means that the characteristic equation $\mu_{\mathbf{X}}=\nu+\mu Q$ generates measures that do not correspond to any occupation measures of the process. This type of measures has been called in [7] phantom solutions of the characteristic equation. The interesting point is that at the same time, we may have

[TABLE]

This means that the set $\big{\{}\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\big{\}}$ which is by the way a subset of $\big{\{}\mu\in\boldsymbol{\mathcal{M}}(\mathbf{X}):\mu_{\mathbf{X}}=\nu+\mu Q\}$ may generate less of phantom solutions.

Two different values of the constraint limit $\theta^{\ast}_{1}$ will be studied. For the first value of the constraint limit, it will be shown that the approach proposed in the present paper can be applied implying that the value of the original control problem coincides with the value of the convex program $\boldsymbol{\mathcal{KP}}_{p}$ . When changing the value of the constraint limit, the Slater condition will not be satisfied. However, it is interesting to observe that in this latter case, the values of the original control problem and its associated convex program $\boldsymbol{\mathcal{KP}}_{p}$ still coincide although the Slater condition is not fulfilled. It appears that the Slater condition is not a necessary condition to establish the correspondance between the constrained control problem and its associated convex program $\boldsymbol{\mathcal{KP}}_{p}$ .

We consider the control model

[TABLE]

where $\mathbf{X}=\mathbb{Z}\cup\{\Delta\}$ and the action set is given by $\mathbf{A}=\{a,b\}$ . For $x\neq 1$ , $\mathbf{A}(x)=\{a\}$ ; $\mathbf{A}(1)=\{a,b\}$ and $\mathbf{A}(\Delta)=\{a\}$ . The stochastic kernel $Q$ is given by $Q(x+1|x,a)=1$ for $x\leq 0$ and $Q(y|x,a)=(1/2)I_{\{x+1\}}(y)+(1/2)I_{\{x+2\}}(y)$ , for $x\geq 1$ and finally, $Q(\Delta|1,b)=Q(\Delta|\Delta,a)=1$ . The one-step reward function is given by $r(x,a)=(1/2)^{|x|}$ for $x\neq 1$ ; $r(1,a)=r(1,b)=1/2$ and $r(\Delta,a)=0$ . The one-step constraint function is given by $c_{1}(x)=(-1/2)^{|x|}$ for $x\neq 1$ ; $c_{1}(1,a)=-1/18$ and $c_{1}(1,b)=1$ . The initial distribution $\nu$ satisfies $\nu(\{1\})=\nu(\{\Delta\})=1/2$ . The constraint limit is given by $\theta^{*}_{1}$ . Two cases are studied: $\theta^{*}_{1}=1/4$ and $\theta^{*}_{1}=1/2$ .

Let $\mu\in\boldsymbol{\mathcal{M}}(\mathbf{X})$ satisfying the characteristic equation $\mu_{\mathbf{X}}=\nu+\mu Q$ and so, $\mu(\Delta,a)=+\infty$ ; $\mu(x,a)=\mu(0,a)$ for $x\leq 0$ ; $\mu(1,a)+\mu(1,b)=1/2+\mu(0,a)$ and finally, $\mu(2,a)=(1/2)\mu(1,a)$ and $\mu(x,a)=(1/2)\mu(x-1,a)+(1/2)\mu(x-2,a)$ for $x\geq 3$ showing that for $x\geq 2$ , $\mu(x,a)=(1/6)[4-(-1/2)^{x-2}]\mu(1,a)$ . Therefore,

[TABLE]

since $\displaystyle\mu(r)=\sum_{x\neq 1}(1/2)^{|x|}\mu(x,a)+(1/2)[\mu(1,a)+\mu(1,b)]$ . This implies that Assumption (A2) in [8] is not satisfied and therefore, the approach developed there cannot be applied.

The stochastic kernel $P$ on $\mathbf{X}$ given $\mathbf{X}$ defined by $P(x|y)=Q(x|y,a)$ for $y\in\{\Delta\}\cup\mathbb{Z}\setminus\{1\}$ and $P(2|1)=P(3|1)=P(\Delta|1)=1/3$ satisfies Assumption 3.

The probability $p$ associated to $P$ and given by (2) satisfies $p(x)=0$ for $x\leq 0$ . As a consequence, $\eta^{\Phi}(x,a)=0$ for any $x\leq 0$ and $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ . Moreover, since $\eta^{\Phi}$ satisfies the characteristic equation, it follows that $\eta^{\Phi}(1,a)+\eta^{\Phi}(1,b)=1/2$ and $\eta^{\Phi}(x,a)=(1/6)[4-(-1/2)^{x-2}]\eta^{\Phi}(1,a)$ for $x\geq 2$ and $\eta^{\Phi}(\Delta,a)=+\infty$ . Thus,

[TABLE]

and similarly,

[TABLE]

where $\eta^{\Phi}(1,a)\in[0,1/2]$ . Clearly, we have $\eta^{\Phi}(r^{+})<+\infty$ and $\eta^{\Phi}(c^{+}_{1})<+\infty$ for any $\Phi\in\boldsymbol{\mathcal{K}}_{p}$ showing that Assumption (B.1) is satisfied.

Now, let $\pi_{a}$ (respectively, $\pi_{b}$ ) be the deterministic stationary policy given by $\pi_{a}(\{a\}|x)=1$ for $x\in\mathbb{Z}\mathop{\cup}\{\Delta\}$ (respectively, $\pi_{b}(\{a\}|x)=1$ if $x\in\mathbb{Z}\mathop{\cup}\{\Delta\}\setminus\{1\}$ and $\pi_{b}(\{b\}|1)=1$ ). It is easy to see that the occupation measure $\mu^{\pi_{a}}$ is given by $\mu^{\pi_{a}}(1,a)=1/2$ ; $\mu^{\pi_{a}}(1,b)=0$ ; $\mu^{\pi_{a}}(\Delta,a)=+\infty$ ; $\mu^{\pi_{a}}(x,a)=0$ for any $x\leq 0$ and $\mu^{\pi_{a}}(x,a)=(1/12)[4-(-1/2)^{x-2}]$ for $x\geq 2$ and the occupation measure $\mu^{\pi_{b}}$ satisfies $\mu^{\pi_{b}}(x,a)=0$ for any $x\in\mathbb{Z}$ ; $\mu^{\pi_{b}}(1,b)=1/2$ and $\mu^{\pi_{b}}(\Delta,a)=+\infty$ . It follows easily $\mu^{\pi_{a}}(r)=\sum_{x\geq 2}(1/2)^{x}(1/12)[4-(-1/2)^{x-2}]+1/4=2/5$ and $\mu^{\pi_{b}}(r)=r(1,b)\mu^{\pi_{b}}(1,b)=1/4$ . Observe also that $\mu^{\pi_{a}}(c_{1})=-1/18+\sum_{x\geq 2}(-1/2)^{x}(1/6)[4-(-1/2)^{x-2}]=0$ and $\mu^{\pi_{b}}(c_{1})=1/2$ . Clearly, the reward $\mathcal{J}_{\nu}(r,\pi)$ takes values in the interval $[\mathcal{J}_{\nu}(r,\pi_{b}),\mathcal{J}_{\nu}(r,\pi_{a})]$ when the policy $\pi$ ranges over $\Pi$ and the constraint $\mathcal{J}_{\nu}(c_{1},\pi)$ takes values in $[\mathcal{J}_{\nu}(c_{1},\pi_{a}),\mathcal{J}_{\nu}(c_{1},\pi_{b})]$ . Therefore, Assumption (B.2) is satisfied.

Finally, Condition (W) is obviously satisfied for this model.

Remark that for any $\alpha\in[0,1]$ , the stationary randomized policy given by $\pi(\{a\}|1)=\alpha$ , $\pi(\{b\}|1)=1-\alpha$ and $\pi(\{a\}|x)=1$ for $x\in\mathbb{Z}\setminus\{1\}$ yields $\mathcal{J}_{\nu}(r,\pi)=(1-\alpha)\mathcal{J}_{\nu}(r,\pi_{b})+\alpha\mathcal{J}_{\nu}(r,\pi_{a})$ and $\mathcal{J}_{\nu}(c_{1},\pi)=(1-\alpha)\mathcal{J}_{\nu}(c_{1},\pi_{b})+\alpha\mathcal{J}_{\nu}(c_{1},\pi_{a})$ .

The case where $\theta^{*}_{1}=1/4.$ From the previous discussion, we have

[TABLE]

where $\pi^{*}$ is the stationary randomized policy given by $\pi^{*}(\{a\}|1)=\pi^{*}(\{b\}|1)=1/2$ , $\pi^{*}(\{a\}|x)=1$ for $x\in\mathbb{Z}\setminus\{1\}$ . Moreover,

[TABLE]

Therefore, the values of the original control problem and the convex program $\boldsymbol{\mathcal{KP}}_{p}$ agree as claimed by Theorem 5.1 since the Slater condition holds.

Observe that the optimal value of the convex program $\boldsymbol{\mathcal{KP}}_{p}$ is achieved for $\eta^{\hat{\Phi}}(1,a)=1/4$ where $\hat{\Phi}\in\boldsymbol{\mathcal{K}}_{p}$ is an optimal solution to the convex program $\boldsymbol{\mathcal{KP}}_{p}$ . Since $p(1)=1/4$ , the stationary policy $\varphi_{\hat{\Phi}}$ induced by $\hat{\Phi}$ is given by $\varphi_{\hat{\Phi}}(\{a\}|1)=\varphi_{\hat{\Phi}}(\{b\}|1)=1/2$ and $\varphi_{\hat{\Phi}}(\{a\}|\Delta)=\varphi_{\hat{\Phi}}(\{a\}|x)=1$ for $x\in\mathbb{Z}\setminus\{1\}$ . This optimal policy corresponds to $\pi^{*}$ as determined above.

The case where $\theta^{*}_{1}=1/2.$ We have for this value of the constraint limit,

[TABLE]

where $\pi^{*}$ is the stationary randomized policy given by $\pi^{*}(\{a\}|1)=0$ , $\pi^{*}(\{b\}|1)=1$ , $\pi^{*}(\{a\}|x)=1$ for $x\in\mathbb{Z}\setminus\{1\}$ .

However, we cannot apply the results of the present paper because in this case the Slater condition is not satisfied. Indeed, for any $\pi\in\Pi$ , $\mathcal{J}_{\nu}(c_{1},\pi)\leq 1/2$ . But, the values of the original control problem and the convex program $\boldsymbol{\mathcal{KP}}_{p}$ still agree since

[TABLE]

Appendix A Appendix

In this appendix, let $m$ be an integer in $\mathbb{N}^{*}$ . Consider the functions $h\in\boldsymbol{\mathscr{M}}(\mathbf{K})$ and $g_{i}\in\boldsymbol{\mathscr{M}}(\mathbf{K})$ for $i\in\mathbb{N}_{m}$ . We will first present a slightly different version of a result derived by M. Schäl in [17, Theorem 1]. The only difference is that, we consider here the expected total reward criterion while in [17], Schäl deals with the conditional version of that performance criterion. We will use it repeatedly in this paper. In this section we will also establish a technical result that is used in section 4.2 to show that in the framework of control problems with constraints, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies.

To use Theorem 1 in [17], we need to introduce the following two sets of conditions:

$\mathbf{(\boldsymbol{\mathcal{S}}1)}$

For any $x\in\mathbf{X}$ , $\mathbf{A}(x)$ is compact. 2. $\mathbf{(\boldsymbol{\mathcal{S}}2)}$

For any $x\in\mathbf{X}$ and $\Lambda\in\boldsymbol{\mathfrak{B}}(\mathbf{X})$ , $Q(\Lambda|x,\cdot)$ is continuous on $\mathbf{A}(x)$ . 3. $\mathbf{(\boldsymbol{\mathcal{S}}3)}$

For any $x\in\mathbf{X}$ , $h(x,\cdot)$ is upper-semicontinuous on $\mathbf{A}(x)$ . 4. $\mathbf{(\boldsymbol{\mathcal{S}}4)}$

For any $x\in\mathbf{X}$ , $g_{i}(x,\cdot)$ for $i\in\mathbb{N}_{m}$ are upper-semicontinuous on $\mathbf{A}(x)$ .

or

$\mathbf{(\boldsymbol{\mathcal{W}}1)}$

For any $x\in\mathbf{X}$ , the action set $\mathbf{A}(x)$ is compact and the multifunction from $\mathbf{X}$ to $\mathbf{A}$ defined by $x\rightarrow\mathbf{A}(x)$ is upper-semicontinuous. 2. $\mathbf{(\boldsymbol{\mathcal{W}}2)}$

For any $f\in\boldsymbol{\mathcal{C}}(\mathbf{X})$ , $Qf$ is continuous on $\mathbf{K}$ . 3. $\mathbf{(\boldsymbol{\mathcal{W}}3)}$

The function $h$ is upper-semicontinuous on $\mathbf{K}$ . 4. $\mathbf{(\boldsymbol{\mathcal{W}}4)}$

The functions $g_{i}$ for $i\in\mathbb{N}_{m}$ are upper-semicontinuous on $\mathbf{K}$ .

Theorem A.1

Suppose $\mu(h^{+})<+\infty$ or $\mu(h^{-})<+\infty$ for any $\mu\in\boldsymbol{\mathcal{O}}$ and either conditions $(\mathcal{S}1)$ - $(\mathcal{S}3)$ or $(\mathcal{W}1)$ - $(\mathcal{W}3)$ are satisfied. Then

[TABLE]

Proof: The proof of this result is essentially the same as Theorem 1 in [17]. The only difference is that, we consider here the expected total reward criterion while in [17], Schäl deals with the conditional version of that performance criterion. By adapting the arguments developed in [17], we obtain easily the result. $\Box$

Proposition A.2

Consider $\tilde{\theta}\in\mathbb{R}^{m}$ . Assume $\sup\big{\{}\mu(h^{+}+g_{i}^{+}):\mu\in\boldsymbol{\mathcal{O}}\mathop{\cup}\{\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\}\big{\}}<+\infty$ ; $\mu(h^{-})<+\infty$ and $\mu(g_{i}^{-})<+\infty$ for $\mu\in\boldsymbol{\mathcal{O}}\mathop{\cup}\{\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\}$ . Suppose also that Assumption 3 and either conditions $(\mathcal{S}1)$ - $(\mathcal{S}4)$ or $(\mathcal{W}1)$ - $(\mathcal{W}4)$ are satisfied. If there exists $\widetilde{\mu}\in\boldsymbol{\mathcal{O}}_{s}$ satisfying $\tilde{\theta}_{i}<\widetilde{\mu}(g_{i})$ for any $i\in\mathbb{N}_{m}$ then

[TABLE]

Proof: Let $\boldsymbol{\mathfrak{R}}$ be either $\boldsymbol{\mathcal{O}}$ or $\{\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\}$ . Clearly $\beta\mu_{1}+(1-\beta)\mu_{2}\in\boldsymbol{\mathfrak{R}}$ for any $\mu_{1}$ , $\mu_{2}$ in $\boldsymbol{\mathfrak{R}}$ and $\beta\in[0,1]$ . Let us define $\displaystyle\mathcal{C}=\mathop{\cup}_{\mu\in\boldsymbol{\mathfrak{R}}}\{\theta\in\mathbb{R}^{p}:\mu(g_{i})\geq\theta_{i}\text{ for }i\in\mathbb{N}_{m}\}$ . $\mathcal{C}$ is clearly a non-empty convex subset of $\mathbb{R}^{p}$ . Define the function $\mathcal{V}$ on $\mathcal{C}$ by

[TABLE]

By hypothesis, $\mathcal{V}$ takes values in $\mathbb{R}$ for any $\theta\in\mathcal{C}$ . Observe that $\mathcal{V}$ is a proper concave on $\mathcal{C}$ . Indeed, consider $\theta_{1}=(\theta_{1,1},\ldots,\theta_{1,m})$ and $\theta_{2}=(\theta_{2,1},\ldots,\theta_{2,m})$ in $\mathcal{C}$ and $\alpha\in[0,1]$ . For any $\epsilon>0$ , there exist $\mu_{j,\epsilon}\in\boldsymbol{\mathfrak{R}}$ for $j=1,2$ satisfying $\mu_{j,\epsilon}(g_{i})\geq\theta_{j,i}$ and $\mu_{j,\epsilon}(h)\geq\mathcal{V}(\theta_{j})-\epsilon/2$ for $i\in\mathbb{N}_{m}$ . Clearly, we have $\big{(}\beta\mu_{1,\epsilon}+(1-\beta)\mu_{2,\epsilon}\big{)}(g_{i})\geq\beta\theta_{1,i}+(1-\beta)\theta_{2,i}$ for any $i\in\mathbb{N}_{m}$ . Therefore,

[TABLE]

showing that $\mathcal{V}$ is a proper concave function on $\mathcal{C}$ . Now, $\tilde{\theta}$ is in the interior of $\mathcal{C}$ , and so $\mathcal{V}$ is continuous at $\tilde{\theta}$ by Proposition 2.17 in [3] and therefore, we can apply Proposition 2.36 in [3] to claim the existence of $\tilde{\lambda}\in\mathbb{R}^{m}$ such that, for all $\theta\in\mathcal{C}$ ,

[TABLE]

Remark that $\tilde{\lambda}\leq\mathbf{0}_{m}$ since $\mathcal{V}(\theta)\geq\mathcal{V}(\tilde{\theta})$ for all $\theta\leq\tilde{\theta}$ . Now, fix an arbitrary $\mu\in\boldsymbol{\mathfrak{R}}$ . Then $(\mu(g_{1}),\cdots,\mu(g_{p}))\in\mathcal{C}$ and so,

[TABLE]

Therefore,

[TABLE]

For any $\epsilon>0$ , there exists $\mu_{\epsilon}\in\boldsymbol{\mathfrak{R}}$ with $\mu_{\epsilon}(g_{i})\geq\tilde{\theta_{i}}$ for any $i\in\mathbb{N}_{m}$ such that $\mu_{\epsilon}(h)\geq\mathcal{V}(\tilde{\theta})-\epsilon$ implying

[TABLE]

since $\tilde{\lambda}\leq\mathbf{0}_{m}$ . Together with (31), this shows

[TABLE]

Now, we have for $\lambda\leq\mathbf{0}_{m}$ ,

[TABLE]

implying

[TABLE]

and so with (32) we obtain

[TABLE]

Therefore, with $\boldsymbol{\mathfrak{R}}=\boldsymbol{\mathcal{O}}$

[TABLE]

and with $\boldsymbol{\mathfrak{R}}=\{\eta^{\Phi}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\}$

[TABLE]

Now, for $\lambda\leq\mathbf{0}_{m}$ we have $\sup\Big{\{}\eta^{\Phi}\Big{(}\big{(}h-\langle\lambda,g\rangle\big{)}^{+}\Big{)}:\Phi\in\boldsymbol{\mathcal{K}}_{p}\Big{\}}<+\infty$ by hypothesis and we obtain from Lemma 4.2 and Theorem 4.3 that

[TABLE]

and also,

[TABLE]

Therefore, combining equations (34)-(36) we obtain that

[TABLE]

Moreover, Theorem A.1 can be applied to show that

[TABLE]

Combining equations (33), (37) and (38), we obtain the result. $\Box$

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Aliprantis and K. Border. Infinite dimensional analysis . Springer, Berlin, third edition, 2006. A hitchhiker’s guide.
2[2] E. Altman. Constrained Markov decision processes . Stochastic Modeling. Chapman & Hall/CRC, Boca Raton, FL, 1999.
3[3] V. Barbu and T. Precupanu. Convexity and optimization in Banach spaces . Springer Monographs in Mathematics. Springer, Dordrecht, fourth edition, 2012.
4[4] V. Borkar. A convex analytic approach to Markov decision processes. Probab. Theory Related Fields , 78(4):583–602, 1988.
5[5] V. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov decision processes , volume 40 of Internat. Ser. Oper. Res. Management Sci. , pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.
6[6] F. Dufour, M. Horiguchi, and A. Piunovskiy. The expected total cost criterion for Markov decision processes under constraints: a convex analytic approach. Advances in Applied Probability , 44(3):774–793, 2012.
7[7] F. Dufour and A. Piunovskiy. Multiobjective stopping problem for discrete-time Markov processes: convex analytic approach. J. Appl. Probab. , 47(4):947–966, 2010.
8[8] F. Dufour and A. Piunovskiy. The expected total cost criterion for Markov decision processes under constraints. Advances in Applied Probability , 45(3):837–859, 2013.