On Linear Programming for Constrained and Unconstrained Average-Cost Markov Decision Processes with Countable Action Spaces and Strictly Unbounded Costs
Huizhen Yu

TL;DR
This paper develops a linear programming framework for average-cost Markov decision processes with countable actions and unbounded costs, proving duality and optimality without requiring lower-semicontinuity.
Contribution
It introduces a novel approach that handles discontinuous dynamics and costs in countable action space MDPs using a strict unboundedness condition and a majorization condition.
Findings
No duality gap in the linear programming formulation.
Applicable to discontinuous MDP models.
Proven optimality results for a broad class of MDPs.
Abstract
We consider the linear programming approach for constrained and unconstrained Markov decision processes (MDPs) under the long-run average cost criterion, where the class of MDPs in our study have Borel state spaces and discrete countable action spaces. Under a strict unboundedness condition on the one-stage costs and a recently introduced majorization condition on the state transition stochastic kernel, we study infinite-dimensional linear programs for the average-cost MDPs and prove the absence of a duality gap and other optimality results. Our results do not require a lower-semicontinuous MDP model. Thus, they can be applied to countable action space MDPs where the dynamics and one-stage costs are discontinuous in the state variable. Our proofs make use of the continuity property of Borel measurable functions asserted by Lusin's theorem.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Portfolio Optimization · Supply Chain and Inventory Management · Auction Theory and Applications
On Linear Programming for Constrained and Unconstrained Average-Cost Markov Decision Processes with Countable Action Spaces and Strictly Unbounded Costs
Huizhen Yu RLAI Lab, Department of Computing Science, University of Alberta, Canada ([email protected])
Abstract
We consider the linear programming approach for constrained and unconstrained Markov decision processes (MDPs) under the long-run average cost criterion, where the class of MDPs in our study have Borel state spaces and discrete countable action spaces. Under a strict unboundedness condition on the one-stage costs and a recently introduced majorization condition on the state transition stochastic kernel, we study infinite-dimensional linear programs for the average-cost MDPs and prove the absence of a duality gap and other optimality results. Our results do not require a lower-semicontinuous MDP model. Thus, they can be applied to countable action space MDPs where the dynamics and one-stage costs are discontinuous in the state variable. Our proofs make use of the continuity property of Borel measurable functions asserted by Lusin’s theorem.
**Keywords:
**Markov decision processes; Borel state space; countable action space; average cost; constraints
minimum pair; majorization condition; infinite-dimensional linear programs; duality
Contents
-
2.1 MDP Model, Average Cost Criterion, and Minimum Pair Approach
-
2.1.2 Model Assumptions and Existence of Stationary Minimum Pair
-
4.1 Model Assumptions and Existence of Stationary Optimal Pairs
1 Introduction
We consider discrete-time Markov decision processes (MDPs) with the long-run average cost criterion. Our focus will be on the linear programming (LP) approach, for a class of unconstrained and constrained MDPs that have Borel state spaces, discrete countable action spaces, and unbounded one-stage costs.
LP methods for average-cost MDPs have a long history and an extensive literature. For MDPs with finite state and action spaces, see e.g., [9, 24, 25, 28]; for countable state spaces and countable or compact action spaces, see [6, 7, 26, 27, 31]; and for Borel state and action spaces, see [15, 17, 18, 21, 30, 38]. The interested reader may also consult the books [1, 13, 19, 20, 35] and their references. The third group of results deal with uncountably infinite state spaces and are most closely related to our work. In particular, using the theory of infinite-dimensional LP (Anderson and Nash [2]), Hernández-Lerma and Lasserre [18] (see also [20, Chap. 12], [21]) formulated a general LP framework for Borel space average-cost MDPs. They studied the relations between the values of the primal/dual linear programs and the minimum average cost of an MDP, proved the absence of a duality gap under certain continuity conditions on the MDP model, and related the solutions of the programs to stationary optimal policies and average cost optimality equations (ACOE) of the MDP. Much earlier than [18], Yamada [38] considered linear programs for a special class of geometrically ergodic MDPs with compact Euclidean state/action spaces and proved duality results for these problems. Building on the work [18], Hernández-Lerma and González-Hernández [15] provided additional results and generalizations. Extensions of the LP method to constrained average-cost problems were studied by Kurano et al. [30] for compact spaces and by Hernández-Lerma et al. [17] for non-compact spaces.
Another line of research that is closely related to our work, as well as to the prior work on LP mentioned above, is the minimum pair approach for average-cost Borel space MDPs ([14, 29], [19, Chap. 5.7]; see also the related convex analytic approach [6]). With this approach, one considers minimizing the average cost over all policies and initial distributions, and the interest is in the existence of an optimal pair of policy and initial distribution with the following structure. The policy is stationary, and the associated initial distribution is an invariant probability measure of the Markov chain induced by the policy. In this paper we shall call a pair with such a structure a “stationary pair” and if it attains the minimum average cost, a “stationary minimum pair.” The feasibility and solvability of the primal linear programs studied in the prior work mentioned earlier in fact depend on the existence of such pairs. Conversely, a stationary minimum pair, when it exists, can be found by solving a linear program in the space of invariant probability measures induced by stationary policies, thus providing a way to find a stationary optimal policy for a subset of states.111For finite state and action MDPs, Denardo [8] seems to be the first to recognize the relation between the solution of a certain linear program and a stationary minimum pair, and he proposed to find a stationary average-cost optimal policy in a multichain MDP by repeatedly solving those linear programs on subproblems with smaller state spaces. This procedure is not applicable in general when the state space is uncountably infinite, since the “chain structure” of an MDP in this case can be complicated and hard to analyze. For some results on LP for “multichain” Borel space MDPs, see [15].
In some cases, with further ergodicity and regularity conditions, one can also extend the policy to an optimal one over the entire state space and establish stronger optimality, including sample-path optimality, of the policy [14, 29, 32, 37].
Our work builds upon earlier research on the LP and minimum pair methods for average-cost MDPs mentioned above. In those prior results the action space is more general than the countable action space we deal with in this paper. However, except for [38], all of those results assume a lower-semicontinuous MDP model. Namely, they require the one-stage cost functions to be lower semicontinuous and the state transition stochastic kernels to be (weakly) continuous ([38] involves different continuity conditions; see Remark 3.2 for details). Our work does not require this assumption.
We recently introduced in [40] a majorization condition on the state transition stochastic kernel to deal with Borel space MDPs that do not satify such continuity conditions. For the case of countable action spaces (with the discrete topology), we obtained the existence of a stationary minimum pair and other average-cost optimality results analogous to those for lower-semicontinuous MDPs given by [14, 29, 32, 37]. The purpose of the majorization condition is to make use of Lusin’s theorem on the continuity of Borel measurable functions [11, Thm. 7.5.2]. Roughly speaking, we require the existence of finite Borel measures on the state space that can majorize certain sub-stochastic kernels created from the state transition stochastic kernel, at all admissible state-action pairs (see Assumption 2.1(M)). We then use those majorizing finite measures in combination with Lusin’s theorem to extract arbitrarily large (according to a given finite measure) sets on which certain Borel measurable functions involved in our analysis have desired continuity properties. With this technique, we are able to avoid the lower-semicontinuous model assumption and obtain results in [40] that can be applied to MDPs with discontinuous dynamics and one-stage costs, although the application range is currently limited to the case of countable action spaces.
The purpose of this work is to further analyze the implications of the majorization condition and Lusin’s theorem in the LP context, for both unconstrained and constrained MDPs. The main contributions of this paper are as follows.
- (i)
For unconstrained average-cost MDPs, under the strictly unbounded cost condition and the majorization condition (cf. Assumption 2.1), we prove there is no duality gap between the primal and dual linear programs in an LP formulation (see Theorem 3.1). 2. (ii)
For constrained average-cost MDPs, under conditions similar to those in (i), we first prove the existence of a stationary optimal pair and a stationary lexicographically optimal pair (which are analogous to stationary minimum pairs for unconstrained MDPs), and we then prove the absence of a duality gap for an LP formulation (see Theorems 4.1 and 4.2, respectively).
In addition, we also discuss the maximizing sequences of dual linear programs and their relation with certain versions of ACOE (see Prop. 3.1 for unconstrained MDPs and Props. 4.1, 4.2 for constrained MDPs). Our results for unconstrained (resp., constrained) MDPs given in this paper can be compared with some of the prior results in [20, Chap. 12] and [21] (resp., [17] and [30]) for lower-semicontinuous models.
While this paper focuses on the average cost criterion, the analysis we give, with minor changes, can also be applied to constrained (or multi-objective) discounted-cost MDPs similar to those studied in [12, 16, 23], for finding constrained optimal or Pareto optimal policies (for a given initial distribution) using the LP approach, in the case of countable action spaces. In a separate recent work [39] based on similar ideas, we introduced another majorization condition for MDPs where both the state and action spaces are Borel, and used the majorization condition instead of the commonly required continuity/compactness conditions to prove the average cost optimality inequalities via the vanishing discount factor approach.
The rest of this paper is organized as follows. In Section 2 we give background materials about the average-cost MDP model, some prior optimality results for the minimum pair approach, and an overview of linear programs in topological vector spaces. In Section 3 we present our LP formulation and duality results for unconstrained MDPs. We then extend these results to constrained MDPs in Section 4. Proofs for the theorems in Sections 3 and 4 are given in Section 5.
2 Preliminaries
We start with some notations and basic definitions. For a topological space , denotes the Borel -algebra on , and denotes the set of probability measures on . We will refer to nonnegative or signed measures on as Borel measures. A Borel space (a.k.a. standard Borel space) is a separable metrizable space that is homeomorphic to a Borel subset of some Polish space (i.e., a separable and completely metrizable space) [3, Chap. 7]. Let and be Borel spaces. A Borel measurable stochastic kernel on given is a Borel measurable function from into , where the space is endowed with the topology of weak convergence. We denote the stochastic kernel by . When it is continuous on , we call it a continuous stochastic kernel (it is also called weakly continuous or weak Feller in the literature). For the space or more generally, the space of finite Borel measures on , besides the topology of weak convergence just mentioned, we shall also consider other topologies in the next section when these spaces appear in infinite-dimensional linear programs.
We now introduce average-cost MDPs and the minimum pair approach, after which we will briefly review infinite-dimensional linear programs in topological vector spaces.
2.1 MDP Model, Average Cost Criterion, and Minimum Pair Approach
We consider an MDP with state space and action space , where is a Borel space and is a countable space endowed with the discrete topology. The control constraint is specified by a set-valued map . In particular, each state is associated with a nonempty set of admissible actions, and the graph of the map ,
[TABLE]
is assumed to be a Borel subset of . If an action is taken at state , a one-stage cost is incurred, followed by a probabilistic state transition. We assume that the state transition is governed by a Borel measurable stochastic kernel on given , and that the one-stage cost function is nonnegative and Borel measurable, real-valued on , and taking the value outside .
A policy is a sequence of stochastic kernels on that specify how to take actions at each stage, given the history up to that stage. More precisely, for infinite-horizon average cost problems that we consider, a Borel measurable policy is an infinite sequence where for each , \mu_{n}\big{(}da_{n}\!\mid x_{0},a_{0},\ldots,a_{n-1},x_{n}\big{)} is a Borel measurable stochastic kernel on given and obeys the control constraint of the MDP:
[TABLE]
Such a policy is called nonrandomized if in the above every measure on is a Dirac measure, and it is called stationary if the function depends only on the state , in the same way for every . In the stationary case, we can write the policy as for a Borel measurable stochastic kernel on given that obeys the control constraint of the MDP, and we will simply designate this policy by .
Let denote the space of Borel measurable policies, and let be the subset of all stationary policies in . Given that the action space is countable, and are nonempty (see e.g., [40, Sect. 2]), and the Borel measurable policies will be adequate for our purpose—henceforth, we shall simply call them policies. We also note that although is countable, in the above and throughout the paper, we write probability measures on using the general notation for probability measures on a possibly uncountably infinite space, for notational simplicity.
2.1.1 Average Cost Criterion and Minimum Pair
In an MDP, a policy and an initial (state) distribution induce a stochastic process on the infinite product of state and action spaces, . The probability measure for this process is uniquely determined by the initial distribution , the sequence of stochastic kernels in , and the state transition stochastic kernel [3, Prop. 7.28]. We denote this probability measure by and the corresponding expectation operator by . The long-run expected average cost of the policy for the initial distribution is defined by
[TABLE]
We shall also refer to as the average cost of the pair . With the minimum pair approach, we consider the average costs of all policy and initial distribution pairs, and among these pairs, we are especially interested in the types of pairs defined below.
Let be the minimum average cost over all policies and initial distributions:
[TABLE]
Definition 2.1**.**
A pair with is called a minimum pair.
Definition 2.2** (stationary pair and stationary minimum pair).**
- (a)
For a stationary policy and an initial distribution , if is an invariant probability measure of the Markov chain induced by on , we call a stationary pair. The set of all stationary pairs is denoted by . 2. (b)
If is a minimum pair, we call it a stationary minimum pair.
Remark 2.1*.*
Various terminologies are used in the literature for what we call a stationary pair . In the references [17, 18, 20], the policy is called a “stable policy” if . In the reference [7], the probability measure is called an “ergodic occupation measure”—we will discuss such measures in Section 3.1. ∎
2.1.2 Model Assumptions and Existence of Stationary Minimum Pair
We now impose additional conditions on the MDP model. For a set in some space, let denote its complement; for a set , let denote the projection of on . Recall that .
Assumption 2.1**.**
- (G)
For some and , the average cost . 2. (SU)
There exists a nondecreasing sequence of compact sets such that
[TABLE] 3. (M)
For each compact set , there exist an open set , a closed set , and a finite measure on (all of which can depend on ) such that
[TABLE]
where the closed set (possibly empty) is such that restricted to , the state transition stochastic kernel is continuous and the one-stage cost function is lower semicontinuous.222Since is discrete, the continuity condition here means that for each action , and are continuous and lower semicontinuous, respectively, on the set .
The first two conditions in this assumption are standard: (G) excludes vacuous problems, and (SU) defines the case of strictly unbounded one-stage costs. They were used in, e.g., [14, 20, 32, 37] to derive average-cost optimality and LP duality results for lower-semicontinuous MDP models with strictly unbounded costs.
When the function is lower semicontinuous, (SU) is equivalent to being inf-compact on , i.e., is compact for all . In our case, these sets need not be closed and instead, (SU) is equivalent to having compact closures. Note also that the set is -compact under (SU) and, since , the space thus must also be -compact.
Condition (M) was introduced in our recent work [40]. We use the majorization property required in (M) instead of the lower-semicontinuity model conditions commonly required in the literature. The set in (M) is introduced to separate a “continuous part” of the model from the rest, in order to sharpen (M), although this condition can also be used with . Condition (M) seems natural for problems where the probability measures have densities on with respect to (w.r.t.) a common -finite reference measure and those density functions are bounded uniformly from above. For instance, if and the reference measure is the Lebesgue measure, we can take in (M) to be a multiple of the Lebesgue measure restricted to a bounded open set that contains . See [40, Example 3.2 and Remark 3.3] for more specific examples that illustrate situations where (M) is naturally satisfied or cannot be satisfied.
Under the preceding assumption, the following results are proved in [40] by making use of Lusin’s theorem (see [40, Thm. 3.5] for sample-path and other optimality properties of a stationary minimum pair). They are analogous to the prior results for lower-semicontinuous MDPs [14, 20, 29], and they will serve as the starting point for the analyses we present in this paper.
Theorem 2.1** (optimality of stationary pairs [40, Prop. 3.2, Thm. 3.3]).**
Under Assumption 2.1, the following hold:
- (i)
For any pair with , there exists a stationary pair with .
- (ii)
There exists a stationary minimum pair .
2.2 Linear Programs in Topological Vector Spaces
We now give a brief overview of topological vector spaces over the real field and infinite-dimensional linear programs in such spaces. The reader is referred to the books [2, 36] for in-depth studies of these subjects, and to the book [20, Chap. 12.2] for a more detailed introduction than ours. Here we shall focus on a few basic concepts and results that will be needed in this paper.
Let and be two (real) vector spaces, and let denote the element zero for both spaces. The pair is called a dual pair if there is a bilinear form such that
- •
for each in , there exists some with ,
- •
for each in , there exists some with .
For a dual pair , the coarsest topology on under which the function is continuous for every is called the weak topology on determined by , and denoted by . By symmetry, is also a dual pair and , the weak topology on determined by , is likewise defined.
We recall that a topological vector space is a vector space with a topology that is compatible with its algebraic structure (namely, with that topology, the addition and multiplication operations are continuous; see [36, Chap. I.3]). When endowed with the weak topologies given above, each space in a dual pair is a topological vector space that is separated (i.e., a Hausdorff space) and locally convex (i.e., every point in the space has a base of convex neighborhoods) [36, Chap. II.3]. Convergence in under the weak topology can be characterized as follows: a net in converges to if and only if (iff)
[TABLE]
We consider equality-constrained linear programs and their dual linear programs in topological vector spaces. The definitions of these programs involve several objects, which we introduce first:
- •
two dual pairs of vector spaces and , with each space endowed with its respective weak topology;
- •
a linear mapping that is required to be weakly continuous (i.e., is continuous under the topology for and the topology for );
- •
a convex cone in and its dual cone in defined as
[TABLE]
The convex cones and induce a partial ordering “” on and , respectively:
[TABLE]
The linear mapping appears in the constraints of a linear program designated as the primal program (P). Associated with is another linear mapping on the space , called the adjoint or transpose of , that maps each to a linear form on and is defined by the identity relation (where stands for ):
[TABLE]
An important property of and is given by the following proposition:
Proposition 2.1** ([36, Chap. II, Prop. 12 and its corollary]).**
A linear mapping is weakly continuous if and only if . If is weakly continuous, so is .
This proposition gives a convenient way to verify whether a linear mapping is weakly continuous or not. When is weakly continuous, with the weakly continuous mapping , one can define the dual of the primal linear program.
Let and . Consider the following equality-constrained primal linear program (P) in the space and its dual linear program () in the space (cf. [2, Chap. 3.3]):
[TABLE]
Similarities between these programs and standard finite-dimensional linear programs can be seen by writing the constraints and equivalently as and , respectively.
If the program (P) or () has a feasible solution, it is said to be consistent; if it admits an optimal solution, it is said to be solvable. Let and \sup(\text{\text{P}^{*}}) denote the values of (P) and (), respectively. The elementary duality theory (cf. [2, Chap. 3.3]) asserts that if (P) and () are both consistent, then
[TABLE]
If the equality \sup(\text{\text{P}^{*}})=\inf(\text{P}) holds, we say there is no duality gap.
There are several sufficient conditions for the absence of a duality gap. For our purpose, one duality theorem—Theorem 2.2 below from [2]—will be the most important. It characterizes the relation between the value of () and the subvalue of (P), which is defined as follows.
Consider the set defined by
[TABLE]
Let denote the closure of in the weak topology (corresponding to the dual pair with the bilinear form ). We call (P) subconsistent if there exists some with . When (P) is subconsistent, the subvalue of (P) is defined by
[TABLE]
For comparison, note that \inf(\text{P})=\inf\big{\{}r\,\big{|}\,\big{(}b,r\big{)}\in H\big{\}}. Note also that if is the subvalue of (P), then by the definition of the closure , \big{(}b,\underline{\rho}\big{)}\in\mkern 2.5mu\overline{\mkern-3.5muH\mkern-0.7mu}\mkern 0.7mu and there exists some net with for all , such that and , where need not be feasible for (P).
Theorem 2.2** (subconsistency and duality [2, Thm. 3.3]).**
(P) is subconsistent with a finite subvalue if and only if () is consistent with a finite value .
We will apply this theorem in analyzing the duality relationship between the primal and dual linear programs for average-cost MDPs.
3 Linear Programming for Average-Cost MDPs
In this section we study the LP approach for the average-cost MDP under Assumption 2.1. Roughly speaking, the primal linear program (P) is formulated to find a stationary minimum pair among the stationary pairs of the average cost MDP—this is viable since under Assumption 2.1, the set of stationary pairs is nonempty and a stationary minimum pair exists (cf. Theorem 2.1). The dual linear program () is then determined by the primal program and the two dual pairs of vector spaces involved in the formulation (cf. Section 2.2). We present the LP formulation and our main duality results in Sections 3.1 and 3.2, respectively. (The proofs of the theorems are given in Section 5.)
Our formulation of the primal linear program is the same as that given by the prior work [20, Chap. 12.3]. But our dual program formulation is different; it avoids a condition on the state transition stochastic kernel used in [20, Chap. 12.3], without affecting the desired duality result (cf. Remark 3.3). This LP formulation we present is one instance of a general class of formulations discussed in the prior work [18, Sect. 4]; however, for the sake of completeness, we will give a detailed account of it using the terminologies introduced in Section 2.2.
Regarding notations, in what follows, denotes the set of nonnegative numbers. For or , denotes the space of finite signed Borel measures on , and the set of real-valued Borel measurable functions on . We write or for the subset of those nonnegative elements in or , and we will use similar notations for the subspaces of or .
For the one-stage cost function , we will also need to work with its restriction to the set of state and admissible action pairs (on which is finite as we recall). For notational simplicity, we shall use the same notation or for the restriction of to , and the context will make it clear which function is involved in the discussion. Likewise, for a Borel measure on , sometimes we will also need to work with its extension to the whole state-action space , which is simply a Borel measure concentrated on , and conversely, if is a Borel measure on concentrated on , sometimes we will need to consider its restriction to . In such cases, for notational simplicity, we will use the same notation for both measures.
3.1 Primal and Dual Linear Programs
For a Borel measure on , let denote the marginal of on . To define minimization problems on stationary pairs in an MDP, let us first explain a well-known (many-to-one) correspondence between a stationary pair and a Borel probability measure on that satisfies
[TABLE]
The correspondence is essentially given by
[TABLE]
and has the property that
[TABLE]
Indeed, for , as is an invariant probability measure on induced by , we have
[TABLE]
This is the same as (3.1) for the probability measure given by (3.2), since the marginal of is and obeys the control constraint of the MDP. The equality (3.3) follows from the definition of the average cost and the stationarity of the Markov chain under when the initial distribution is . Conversely, given a probability measure satisfying (3.1), by [3, Cor. 7.27.2], we can decompose as in (3.2) with and being a Borel measurable stochastic kernel on given that obeys the control constraint of the MDP. Then, since satisfies (3.1), the pair with satisfies (3.4), which means that is invariant for the Markov chain induced by and hence is a stationary pair. The policy here is in general not unique; however, by stationarity, every from this decomposition of has the same average cost (3.3).
Due to this correspondence between and , finding a stationary minimum pair can be expressed as an optimization problem in which one minimizes over the set of probability measures that satisfy (3.1) (a.k.a. the set of “ergodic occupation measures” [7]).
Before expressing this optimization problem as a linear program, we also need to restrict attention to those stationary pairs that have finite average costs, so that does not appear in the objective function and the constraints. The following definitions are introduced for this purpose. Consider a positive weight function ,
[TABLE]
Let be the set of finite, signed Borel measures on w.r.t. which the function is integrable:
[TABLE]
where denotes the total variation of . Let be the set of Borel measurable functions on such that
[TABLE]
Then every is integrable w.r.t. all . By (3.3) and the definition of , if a stationary pair has finite average cost, then the corresponding probability measure .
We are now ready to define the primal and dual linear programs for the average-cost MDP. Let us specialize the programs (P) and () defined in Section 2.2, by identifying the objects involved in those programs as follows:
- •
The dual pair (X,Y)=\big{(}\mathbb{M}_{w}(\Gamma),\mathbb{F}_{w}(\Gamma)\big{)}, with the bilinear form
[TABLE]
- •
The dual pair (Z,W)=\big{(}\mathbb{R}\times\mathbb{M}(\mathbb{X}),\,\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\big{)}, where is the set of finite signed Borel measures on as defined earlier, is the set of bounded Borel measurable functions on , and the bilinear form on \big{(}\mathbb{R}\times\mathbb{M}(\mathbb{X})\big{)}\times\big{(}\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\big{)} is defined as
[TABLE]
- •
The convex cone , the subset of nonnegative measures in . The dual cone of is , the subset of nonnegative functions in .
- •
The objective function of the primal program (P) is , and the feasible set of (P) is defined by the following constraints:
[TABLE]
where is the marginal of on , as we recall. In other words, in accordance with the earlier discussion, the feasible solutions of (P) correspond to those stationary pairs with finite average costs, and the objective is to minimize the average cost over them. In the form of (P) discussed in Section 2.2, the two equality constraints in (3.5) can be written as
[TABLE]
Here is the trivial measure on (i.e., for all ), and the linear mapping is defined as with where, for ,
[TABLE]
- •
From the identity \big{\langle}\gamma,L^{*}(\rho,h)\big{\rangle}=\big{\langle}L\gamma,(\rho,h)\big{\rangle}, the adjoint of is given by the linear mapping that maps each to the function
[TABLE]
Since L^{*}\big{(}\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\big{)}\subset\mathbb{F}_{w}(\Gamma), both and are weakly continuous ([36, Chap. II, Prop. 12 and its corollary]; see also Prop. 2.1). The inequality constraint in the program () is
[TABLE]
We can write this constraint as or more explicitly, as
[TABLE]
The objective function of the dual program () is \big{\langle}b,\,(\rho,h)\big{\rangle}=\big{\langle}(1,{\it 0}),\,(\rho,h)\big{\rangle}=\rho.
Expressed in the form introduced in Section 2.2, the primal and dual linear programs for the average-cost MDP are:
[TABLE]
and
[TABLE]
As mentioned earlier, our formulation of () is different from the one given in the book [20, Chap. 12.3]. We will explain the difference and the reason for it in detail in the next subsection (see Remark 3.3).
A few properties of (P) and () are easy to see. From the relation between stationary pairs and feasible solutions of the primal program (P), it is clear that under Assumption 2.1, the existence of a stationary minimum pair (cf. Theorem 2.1(ii)) ensures that (P) is both consistent and solvable. Moreover, the proof of Theorem 2.1(ii) (cf. [40]) shows that due to the strict unbounedness of the one-stage costs, if is a sequence of feasible solutions of (P) with (such a sequence is called a minimizing sequence of (P)), then any subsequence of contains a further subsequence that converges to an optimal solution of (P) in the topology of weak convergence (of probability measures). The consistency of the dual program () is trivial: since , a feasible solution is given by and . We then have 0\leq\sup(\text{\text{P}^{*}})\leq\inf(\text{P})=\rho^{*} under Assumption 2.1.
Next, we will address the duality between (P) and (). We will also examine a connection between () and the ACOE for the MDP, through a maximizing sequence of (). Such a sequence is defined as a sequence of feasible solutions of () with the property that \rho_{n}\uparrow\sup(\text{\text{P}^{*}}).
3.2 Optimality Results and Discussion
Our main result of this section is the absence of a duality gap stated in part (ii) of the following theorem. It can be compared with the prior result of [20, Chap. 12.3, Thm. 12.3.4] for average-cost lower-semicontinuous MDPs. In our case, without lower-semicontinuity model assumptions, we will use Lusin’s theorem together with the majorization property in Assumption 2.1(M) to prove it.
Theorem 3.1** (consistency and absence of a duality gap).**
Under Assumption 2.1, the linear programs (P) and () in (3.10)-(3.11) satisfy the following:
- (i)
(P) is consistent and solvable, and () is consistent.
- (ii)
There is no duality gap: \inf(\text{P})=\sup(\text{\text{P}^{}})=\rho^{*}.*
Remark 3.1* (about the proof of Theorem 3.1).*
Besides the differences in assumptions as mentioned above, another difference between our proof of the absence of a duality gap and the proof given in the prior work [20, Chap. 12.3C] is the following. The approach of the latter proof is to show that the set defined by (2.4) is weakly closed (i.e., ). This is a sufficient condition for the absence of a duality gap, but it requires one to show that every point of is in . Our proof uses the duality between the subvalue of (P) and the value of () asserted in [2, Thm. 3.3] (cf. Theorem 2.2). With this it suffices to show that a single point of , namely, the point \big{(}b,\underline{\rho}\big{)}=\big{(}(1,{\it 0}),\underline{\rho}\big{)}, is in . Thus our proof is simpler in this respect.
We can also prove that is weakly closed under our assumptions. This requires some minor changes in the proof arguments used in [40], which we will also use to prove Theorem 3.1 (in particular, we only need to change slightly the finite measures used when applying Lusin’s theorem). Nonetheless, it will take some space to explain the details of those changes, and this is another reason that we choose to use the duality theorem [2, Thm. 3.3] instead in our proof. ∎
Remark 3.2* (comparison with a duality result in [38]).*
For compact Euclidean state and action spaces, Yamada proved an LP duality result [38, Thm. 3] under certain continuity and ergodicity conditions on the MDP. His continuity conditions are different from the lower-semicontinuous model assumption we mentioned, but they can be related to our model assumptions. So let us explain in more detail how our assumptions and duality result compare with his. Among others, Yamada assumed that is continuous in for each fixed , and has a density w.r.t. the Lebesgue measure, where is continuous in for each fixed [38, Condition (A2)]. In our case, since the action space has the discrete topology, trivially, and are continuous in for each fixed , so there are similarities to Yamada’s conditions. Our majorization condition (M) is, however, entirely different from Yamada’s geometric ergodicity condition [38, Conditions (A1), (A4)], in which he required the density function to be bounded away from zero uniformly for all . Using this condition together with the continuity and other assumptions, he proved the absence of a duality gap [38, Thm. 3]. Both his conditions and his proof arguments are very different from ours. ∎
Remark 3.3* (about the formulation of () and its solvability).*
In defining (), we have chosen the space of bounded Borel measurable functions to form the dual pair with the space of finite Borel measures. With this choice, () is in general not solvable (i.e., an optimal solution may not exist), since the inequality
[TABLE]
need not admit a bounded solution .
As mentioned earlier, our LP formulation is only an instance of the class of formulations discussed in [18, Sect. 4]. A different dual program () is studied in [20, Chap. 12.3]. It involves, instead of \big{(}\mathbb{R}\times\mathbb{M}(\mathbb{X}),\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\big{)}, the dual pair \big{(}\mathbb{R}\times\mathbb{M}_{w_{0}}(\mathbb{X}),\mathbb{R}\times\mathbb{F}_{w_{0}}(\mathbb{X})\big{)}, where the two spaces and are defined similarly to and , respectively: with , ,
[TABLE]
This choice leaves more room for () to admit an optimal solution. However, a disadvantage is that to ensure the weak continuity of the linear mapping , an additional condition on the state transition stochastic kernel is required (cf. [20, Chap. 12.3A, Assumption 12.3.1]): for some constant ,
[TABLE]
Yet, since the costs are strictly unbounded, this condition (3.12) is neither needed for the existence of a minimum pair, nor needed for the absence of a duality gap between (P) and ().
Also, the use of the dual pair \big{(}\mathbb{R}\times\mathbb{M}_{w_{0}}(\mathbb{X}),\mathbb{R}\times\mathbb{F}_{w_{0}}(\mathbb{X})\big{)} alone cannot guarantee that () has an optimal solution, for which one would still need to make additional assumptions about the functions in a maximizing sequence for ()(cf. [20, Chap. 12.4B, Thm. 12.4.2]). This makes it less appealing to us to have the dual pair \big{(}\mathbb{R}\times\mathbb{M}_{w_{0}}(\mathbb{X}),\mathbb{R}\times\mathbb{F}_{w_{0}}(\mathbb{X})\big{)} with its extra condition (3.12) in the LP formulation.
For these reasons, we have formulated () differently. Accordingly, we treat the result on ACOE given in the next proposition not as the property of a dual optimal solution, which may not exist, but as a potential consequence of the results from the LP approach. ∎
As just noted, the dual program () in our formulation need not admit an optimal solution. However, because there is no duality gap, one can still obtain a version of ACOE for the MDP from a maximizing sequence of (), under certain conditions on , using essentially the same arguments as those for [20, Chap. 12.4B, Thm. 12.4.2(c)]. We include the result in the proposition below, for the sake of completeness. The first part of its condition is satisfied under Assumption 2.1 (Theorem 3.1); the second part of its condition specifies the additional conditions on we need. The ACOE (3.14) in the conclusion holds for “almost all” (a.a.) states and in general, it need not hold for all (see e.g., [40, Example 3.1]).
Proposition 3.1** (ACOE for -a.a. states).**
Let be a maximizing sequence of the dual program (), and let . Suppose that:
- (i)
a stationary minimum pair exists and \inf(\text{P})=\sup(\text{\text{P}^{}})=\rho^{*}<+\infty;*
- (ii)
the functions satisfy that
[TABLE]
Then is finite everywhere,
[TABLE]
and for -a.a. ,
[TABLE]
Remark 3.4*.*
We discuss briefly a relation between the above ACOE and nonrandomzed stationary optimal policies for the average-cost MDP. Firstly, one can find a subset with and a Borel measurable function with for all , such that is absorbing w.r.t. and attains the minimum in the ACOE (3.14) on :
[TABLE]
More specifically, to find such and , consider the set with on which (3.14)-(3.15) hold, and the Markov chain induced by the policy and the initial distribution . Since is an invariant probability measure of this Markov chain, one can construct a set with that is absorbing under (see the proof of [22, Lem. 2.2.3(c)] or [33, Prop. 4.2.3(ii)]). Next, based on the relations (3.14)-(3.15) on , the desired function can be found: this can be done either directly in the special case of a countable action space we have here, or, more generally, by using the Blackwell and Ryll-Nardzewski selection theorem [5, Thm. 2] as discussed in [18, Remark 4.6].
Secondly, for and satisfying (3.16), one can apply standard arguments to show that under certain conditions, the nonrandomized stationary policy is average-cost optimal for all initial states . In particular, if , it is straightforward to show that is optimal on . In more general cases of , the optimality of on can be established by imposing further conditions to ensure that for all , \mathbb{E}_{x}^{f}\big{[}|h^{*}(x_{n})|\big{]}<\infty for and \liminf_{n\to\infty}n^{-1}\mathbb{E}_{x}^{f}\big{[}h^{*}(x_{n})\big{]}\geq 0. (For derivation details, see e.g., the related discussions in [18, Sect. 3] and [19, Chap. 5.2] on canonical triplets.) ∎
4 Extension to Constrained Average-Cost MDPs
In this section, we extend our results for an unconstrained average-cost MDP to a constrained one. Let the state and action spaces and the state transition stochastic kernel of the MDP be the same as before. Consider multiple one-stage cost functions on : . We assume that these functions are nonnegative and Borel measurable, finite on , and taking the value outside . The goal is to minimize the average cost w.r.t. , while keeping the average costs w.r.t. within given limits.
More specifically, let be prescribed upper limits on the average costs in the constraints. For a policy and initial distribution , let denote the average cost of this pair w.r.t. , . Define the feasible set of policy and initial distribution pairs by
[TABLE]
Define the optimal average cost of this constrained problem to be
[TABLE]
As before, within the feasible set , we are especially interested in those stationary pairs. Analogous to the minimum pairs and stationary minimum pairs for an unconstrained MDP, let us define optimal pairs and stationary optimal pairs for the constrained MDP. (What we call optimal pairs are called “constrained optimal pairs” in the prior work [30].)
Definition 4.1** (optimal pairs).**
- (a)
We call an optimal pair for the constrained MDP if
[TABLE] 2. (b)
We call an optimal pair lexicographically optimal if for each , either for all , or for some ,
[TABLE]
Definition 4.2** (stationary optimal pairs).**
If a stationary pair is (lexicographically) optimal for the constrained MDP, we call it a stationary (lexicographically) optimal pair.
In what follows, we first adapt the strict unboundedness condition (SU) and the majorization condition (M) to accommodate multiple one-stage cost functions in the constrained MDP, and under those modified conditions we show that stationary optimal pairs exist (Section 4.1). We then formulate primal/dual linear programs for the constrained MDP and present duality results that are analogous to the ones for unconstrained problems (Section 4.2). The proofs of the theorems of this section are collected in Section 5.2.
4.1 Model Assumptions and Existence of Stationary Optimal Pairs
We impose the following conditions on the constrained MDP model:
Assumption 4.1**.**
- (G)
The feasible set . 2. (SU)
There exists a nondecreasing sequence of compact sets such that for some ,
[TABLE] 3. (M)
For each compact set , there exist an open set , a closed set , and a finite measure on (all of which can depend on ) such that
[TABLE]
where the closed set (possibly empty) is such that restricted to , the state transition stochastic kernel is continuous and all the one-stage cost functions , , are lower semicontinuous.
This assumption is similar to Assumption 2.1 for the unconstrained problem. Condition (G) is to exclude vacuous problems. Condition (SU) is the same as that considered in [17] for the constrained MDP, and it differs from Assumption 2.1(SU) in that here we require some one-stage cost function in the constrained problem to be strictly unbounded. Condition (M) is almost identical to Assumption 2.1(M) except that here the closed set must be such that on it, every one-stage cost function in the constrained problem is lower semicontinuous in the state variable. As before, having a nonempty set in the majorization condition (M) sharpens this condition by allowing us to treat a “continuous” part of the model separately from the rest.
Theorem 4.1 below extends our earlier results for MDPs [40, Prop. 3.2, Thm. 3.3] (cf. Theorem 2.1) to constrained MDPs. In particular, its part (i) can be compared with Theorem 2.1(i), and its parts (ii)-(iii) with Theorem 2.1(ii). The proof will only be outlined in Section 5.2, as it is mostly based on the arguments given in [40]—roughly speaking, the present majorization condition allows us to apply the reasoning in [40] to every one-stage cost function in the constrained MDP.
Parts (i)-(ii) of this theorem are also comparable with the results of [17, Thm. 3.2] and [30, the solvability part of Lem. 2.3] for constrained lower-semicontinuous MDPs. Part (iii) concerns lexicographically optimal solutions of the constrained MDP, which can be related to solutions for multi-objective MDPs similar to those discussed in [23].
Theorem 4.1** (optimality of stationary pairs).**
Under Assumption 4.1, the following hold:
- (i)
For any pair , there exists a stationary pair with
[TABLE]
- (ii)
There exists a stationary optimal pair .
- (iii)
There exists a stationary lexicographically optimal pair .
Remark 4.1*.*
It is known that even in a finite-state-and-action MDP, for a given initial state or distribution, there need not exist a stationary optimal policy for the constrained average cost problem. See [25, Sect. 4, p. 284] for an interesting counterexample (involving a multichain MDP) that is due to Derman [10]. The difference between this known fact and the existence of a stationary optimal pair in Theorem 4.1 is that in the constrained MDP here, the initial distribution is not given and there is freedom of choosing it to optimize the average costs. ∎
Remark 4.2* (pathwise average costs of ).*
Suppose that in part (ii) or (iii) of Theorem 4.1, the policy induces on a positive Harris recurrent Markov chain (see e.g., [33, Chap. 10.1] for definition). Then, by the ergodic properties of such Markov chains and by the same proof of [40, Thm. 3.5(b)], we have that for all initial distributions , -almost surely,
[TABLE]
In other words, almost surely, on each sample path, the pathwise average costs of the policy w.r.t. , , are also within the prescribed limits , while its pathwise average cost w.r.t. equals as well. ∎
4.2 Linear Programming Formulation and Optimality Results
Similarly to the unconstrained case, for the constrained MDP, the primal linear program (P) is formulated to minimize the average cost over feasible stationary pairs, by utilizing the correspondence between a stationary pair and a probability measure that satisfies (3.1) discussed at the beginning of Section 3.2. Under Assumption 4.1, the existence of a stationary optimal pair given by Theorem 4.1 ensures that such a pair can be obtained by solving the primal program (P). The dual linear program () is, as before, determined by (P) and two dual pairs of vector spaces we choose.
We now define precisely (P) and () for the constrained MDP, by identifying the spaces and linear mappings involved in the general LP formulation given in Section 2.2. To define the primal linear program (P), we consider the dual pair of vector spaces
[TABLE]
where the weight function is given by
[TABLE]
The bilinear form associated with this dual pair is defined as the sum of the bilinear forms associated with the two dual pairs, \big{(}\mathbb{M}_{w}(\Gamma),\,\mathbb{F}_{w}(\Gamma)\big{)} and ; i.e.,
[TABLE]
for , , and (with denoting their th components).
The feasible set of (P) corresponds to the subset of stationary pairs that are feasible for the constrained MDP, and it is defined by the following constraints:
[TABLE]
and
[TABLE]
Note that if is a probability measure associated with some stationary pair via (3.2), then is feasible for (P); in particular, , so . The objective of (P) is to minimize the average cost . We can state the primal program (P) in the form introduced in Section 2.2 as follows:
[TABLE]
where the linear mapping is given by with
[TABLE]
for and .
To define the dual linear program (), we consider the dual pair of vector spaces
[TABLE]
with the bilinear form defined as the sum of the bilinear forms for the three dual pairs, , \big{(}\mathbb{M}(\mathbb{X}),\,\mathbb{F}_{b}(\mathbb{X})\big{)}, and , similar to (4.2). From the definition of , the adjoint mapping can be identified: it is the linear mapping on given by
[TABLE]
for . Clearly, , so both linear mappings and are weakly continuous ([36, Chap. II, Prop. 12 and its corollary]; cf. Prop. 2.1). The objective function of () is
[TABLE]
Let us now state the dual program () in the form introduced in Section 2.2:
[TABLE]
Note that the inequality constraint in (4.9) is the same as the cone constraint -L^{*}(\rho,h,\beta)+\big{(}c_{0},\,0\big{)}\in\mathbb{F}^{+}_{w}(\Gamma)\times\mathbb{R}_{+}^{d} (cf. Section 2.2), and it can be expressed more explicitly as
[TABLE]
The next theorem about the primal/dual programs (P) and () is an extension of Theorem 3.1 to the constrained MDP. The solvability of (P) is a consequence of the existence of a stationary optimal pair given in Theorem 4.1(ii). Moreover, the proof of Theorem 4.1(ii) also shows that any minimizing sequence of (P) has a subsequence , where is an optimal solution of (P) and in the topology of weak convergence of probability measures.
The absence of a duality gap is the main result of this section. Its proof, outlined in Section 5.2, uses essentially the same proof arguments for Theorem 3.1(ii), which handle the discontinuous MDP models by making use of Lusin’s theorem together with the majorziation property in Assumption 4.1(M).
Theorem 4.2** (consistency and absence of a duality gap).**
Under Assumption 4.1, the following hold for the linear programs (P) and () given in (4.3) and (4.9):
- (i)
(P) is consistent and solvable, and () is consistent.
- (ii)
There is no duality gap: \inf(\text{P})=\sup(\text{\text{P}^{}})=\rho^{*}_{c}.*
This theorem is comparable with the prior results [17, Thm. 4.4] and [30, Lem. 2.3] on the LP approach for constrained lower-semicontinuous MDPs ([30] considers compact spaces, and [17] non-compact spaces). Besides the differences in model assumptions, our formulation of the dual program () also differs from that in [17]. The main difference lies in the choice of the spaces and for (). As in the unconstrained case, our motivation for this choice is to avoid an extra condition on the state transition stochastic kernel used in [17], which is the same condition (3.12) from [20, Chap. 12.3] that we discussed earlier in Remark 3.3. For the same reason as explained in Remark 3.3, the dual program () as we formulated above need not admit an optimal solution.
For completeness, in the rest of this section, we discuss some solution properties of the dual program () and derive a version of ACOE for the constrained MDP. Consider a maximizing sequence of (), i.e., feasible solutions of () with \rho_{n}+\langle\kappa,\,\beta_{n}\rangle\uparrow\sup(\text{\text{P}^{*}}). We first examine the boundedness property of . Denote by the th component of . Let us separate the constraints of the MDP into two categories:
[TABLE]
When , consists of all those such that w.r.t. , every feasible pair in has the same maximally allowed average cost .
Proposition 4.1**.**
Suppose Assumption 4.1 hold. Let be a maximizing sequence of the dual program (). Then the following hold:
- (i)
The sequence is bounded for every .
- (ii)
For , if for some stationary optimal pair of the constrained MDP.
- (iii)
Suppose there exists such that
[TABLE]
Then the sequence is bounded.
Remark 4.3*.*
An optimal solution of (P) corresponds to a stationary optimal pair with for (this follows from the correspondence relationship explained at the beginning of Section 3.2). So Prop. 4.1(ii) entails the complementarity relation for an optimal solution of (P), if we define as follows: if this limit exists, and assign an arbitrary number otherwise. Proposition 4.1(iii) gives a sufficient condition under which the -components of are also bounded—note that this condition involves non-feasible policy and initial distribution pairs and is different from the Slater condition , . One exceptional case where Prop. 4.1 is inapplicable is when . ∎
When is bounded, as when the condition of Prop. 4.1(iii) holds, we can choose a subsequence of the maximizing sequence so that converges. The subsequence is obviously also a maximizing sequence for (). Then, with additional assumptions on the functions , we can derive an optimality equation for the constrained MDP that is analogous to the ACOE (3.14) in Prop. 3.1 for the unconstrained MDP. We state this result in the next proposition. It is comparable with the result of [17, Thm. 5.2(b)] for constrained lower-semicontinuous MDPs; in the latter reference, (4.14) is called the “constrained optimality equation.”
Proposition 4.2** (ACOE for -a.a. states in the constrained MDP).**
Let be a maximizing sequence of the dual program (), and let . Suppose that:
- (i)
a stationary optimal pair exists and \inf(\text{P})=\sup(\text{\text{P}^{}})=\rho^{*}_{c}<+\infty;*
- (ii)
the functions satisfy that
[TABLE]
- (iii)
the sequence converges to some finite .
Then is finite everywhere and with
[TABLE]
we have
[TABLE]
and for -a.a. ,
[TABLE]
5 Proofs
This section collects the proofs of the theorems given in Sections 3 and 4.
5.1 Proofs for Section 3
Let us first recall a few definitions and facts about probability measures on a metrizable space . Let denote the set of real-valued, bounded continuous functions on . By definition, a sequence of probability measures converges weakly to some , denoted , if for all . If is a family of probability measures in such that for any , there is a compact set with for all , we say that is tight.
By Prohorov’s theorem [4, Thm. 6.1], any sequence in a tight family has a further subsequence that converges weakly to a probability measure in . We will use this fact many times in our proofs, for some family that satisfies . By the strict unboundedness condition on given in Assumption 2.1(SU), such a family must be tight (as can be seen easily from condition (SU) and the definition of tightness).
5.1.1 Proof of Theorem 3.1
The consistency of (P) and () and the solvability of (P) were already discussed in Section 3.1, where we also showed that under Assumption 2.1, 0\leq\sup(\text{\text{P}^{*}})\leq\inf(\text{P})=\rho^{*}.
We now prove that there is no duality gap between (P) and (). Our approach is to use [2, Thm. 3.3] (cf. Theorem 2.2 in Section 2.2), which asserts the equality between the subvalue of (P) and the value of () when they are finite. Specifically, recall from Section 2.2 that the subvalue of (P) is defined as
[TABLE]
where the set is given by
[TABLE]
and is the closure of in the weak topology \sigma\big{(}\mathbb{R}\times\mathbb{M}(\mathbb{X})\times\mathbb{R},\,\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\times\mathbb{R}\big{)}. Since (P) and () are consistent, \sup(\text{\text{P}^{}}) is finite and equals the subvalue by [2, Thm. 3.3] (cf. Theorem 2.2). So, to show \inf(\text{P})=\sup(\text{\text{P}^{}}), we need to prove . In what follows, we will prove that
[TABLE]
by constructing a stationary pair whose average cost is no greater than . This will give us (since it implies , whereas ). The proof will proceed in four steps, with the first three steps making preparations for the last one.
Step (i): From the definition of , it follows that \big{(}(1,{\it 0}),\underline{\rho}\big{)}\in\mkern 2.5mu\overline{\mkern-3.5muH\mkern-0.7mu}\mkern 0.7mu and moreover, there exist a direct set and a net in with
[TABLE]
in the \sigma\big{(}\mathbb{R}\times\mathbb{M}(\mathbb{X})\times\mathbb{R},\,\mathbb{R}\times\mathbb{F}_{b}(\mathbb{X})\times\mathbb{R}\big{)} topology. This means that
[TABLE]
In view of (5.2), there exists such that for all , . Then, since all are nonnegative measures and , by restricting attention to , and considering the normalized measures instead of , we can redefine the net in the above so that every is a probability measure on :
[TABLE]
Step (ii): Next, from the net , we will extract a sequence of probability measures with the property that the convergence in (5.3) holds for a countable subset of the functions in . We start by defining this subset. It consists of two countable families of functions, and . The set involves continuous bounded functions that will be used to determine if two probability measures on are equal. The set involves indicator functions of certain sets in that will be important in the subsequent proof to handle the discontinuities in the MDP model by using Lusin’s theorem and the majorization property in Assumption 2.1(M). The construction of will use the arguments we used in the proof of [40, Thm. 3.5(a)]. The precise definitions of these two sets are as follows.
Recall that is the set of (real-valued) bounded continuous functions on . Since is metrizable, by [34, Chap. II, Thm. 6.6], there exists a countable set
[TABLE]
such that in , a sequence of probability measures if and only if
[TABLE]
Then by [11, Prop. 11.3.2], for any ,
[TABLE]
The countable set is the first family of functions we will need.
We now define the other countable family of indicator functions mentioned earlier. The definition of this set involves some new notations and Lusin’s theorem.
Let denote the set of all positive integers. For , define the truncated one-stage cost function on (later, a technical argument in Step (iv) of our proof will involve these functions). For each , corresponding to the compact set in Assumption 2.1(SU), let be the open set, the closed set, and the finite measure, respectively, in Assumption 2.1(M) for . Let , the projection of on . Then the set is compact, and since is countable and discrete, this means that the set is finite.
Lemma 5.1**.**
For each and , there exist closed subsets and of such that the following hold:
- (i)
\nu_{j}\big{(}\mathbb{X}\setminus B^{1}_{j,m,\ell}\big{)}\leq\ell^{-1}* and \nu_{j}\big{(}\mathbb{X}\setminus B^{2}_{j,\ell}\big{)}\leq\ell^{-1};* 2. (ii)
restricted to the set , the function is continuous, and restricted to the set , the state transition stochastic kernel is continuous.
Proof.
This lemma is a consequence of Lusin’s theorem (see [11, Thm. 7.5.2]), which asserts that if is a Borel measurable function from a topological space into a separable metric space and is a closed regular finite Borel measure on , then for any , there is a closed set such that and the restriction of to is continuous.
We apply this theorem with and for each in the lemma. Since is a metrizable topological space, every finite Borel measure is closed regular by [11, Thm. 7.1.3], and therefore, the finite measure in the lemma meets the condition in Lusin’s theorem.
For each , to find the desired closed set , we apply Lusin’s theorem with , , and , and with the function for each action . This gives us, for each , a closed set such that and restricted to , is continuous. Then the closed set has the desired property that \nu_{j}\big{(}\mathbb{X}\setminus B^{1}_{j,m,\ell}\big{)}\leq\ell^{-1} and restricted to , is continuous.
For each , the desired closed set is constructed similarly, by applying Lusin’s theorem to the state transition stochastic kernel , which is a -valued Borel measurable function on . Specifically, we let , , , and . (Since is separable and metrizable, by [3, Prop. 7.20], is also a separable metrizable space and hence meets the condition for the space in Lusin’s theorem.) We apply Lusin’s theorem to for each to obtain a closed set such that and restricted to , is continuous. We then let the desired set . ∎
We group , in the preceding proof into two countable collections and :
[TABLE]
Let denote the indicator function for a set . Finally, define a countable set of indicator functions on by
[TABLE]
Note that the sets in (5.6) are open sets (since is open and are closed); this fact will be useful later.
We now extract a desirable sequence from the net :
Lemma 5.2**.**
There exists a sequence such that
[TABLE]
Proof.
Let us order the functions in the countable set as . Choose any and let for . For each , by (5.3)-(5.4), there exists , such that for all ,
[TABLE]
Let . The resulting sequence satisfies (5.7)-(5.8). ∎
Step (iii): Henceforth, we work with the sequence of probability measures given by Lemma 5.2. The relation (5.8) together with Assumption 2.1(SU) implies that is a tight family of probability measures on . So by Prohorov’s theorem [4, Thm. 6.1], it has a subsequence that converges weakly to some probability measure on . To simplify notation, let us use the same notation to denote the convergent subsequence. Thus .
By [3, Cor. 7.27.2], the probability measure can be decomposed into its marginal on and a stochastic kernel on given that obeys the control constraint of the MDP; i.e.,
[TABLE]
This gives us a stationary policy . Before we investigate the property of the pair in the next step, we need the following majorization property, which will be used to deal with the discontinuities in the MDP model:
Lemma 5.3**.**
For every ,
[TABLE]
Proof.
For , let and since the indicator function , we have, by (5.7) in Lemma 5.2, that
[TABLE]
We also have, by Assumption 2.1(M),
[TABLE]
Hence for all ; consequently, .
Now (since ) and is an open set (since is open and are closed). Therefore, by [11, Thm. 11.1.1] and the first part of the proof, . ∎
Step (iv): We are now ready to prove that \big{(}(1,{\it 0}),\underline{\rho}\big{)}\in H.
Lemma 5.4**.**
The pair is a stationary pair with .
Proof outline.
We will only outline the proof, because the arguments for this lemma are essentially the same as those we used in an earlier work to prove the existence and pathwise optimality properties of stationary pairs [40, Sect. 4.1 and Sect. 4.3.1]. By Lemma 5.2, it suffices to prove the inequality
[TABLE]
and to prove that for all ,
[TABLE]
To see the sufficiency of (5.9) and (5.10), note that (5.10), together with (5.7) in Lemma 5.2 and the fact for all (since ), will imply that
[TABLE]
In turn, this will imply that is identical to the probability measure (cf. (5.5)), thus proving that is an invariant probability measure for the Markov chain induced by the policy and hence is a stationary pair. Then the first relation (5.9) will give us the desired inequality .
Proving (5.9): The proof of (5.9) is essentially the same as that given in [40, Sect. 4, proofs of Lems. 4.3, 4.9]. Below, we sketch the main proof arguments (see the proofs in [40] for the details of each step):
To show (5.9), it suffices to show that for each ,
[TABLE]
(In the above, the probability measures and are extended from to , and is the truncated one-stage cost function , as we recall.) 2. 2.
Fix . To prove (5.11), consider arbitrarily small , for some arbitrarily large . Assumption 2.1(SU) together with (5.8) in Lemma 5.2 allows us to choose large enough so that for the compact set in Assumption 2.1(SU), we have for all and . This in turn allows us to bound and by , an negligible term when we take . Consequently, to prove (5.11), we can focus on the integrals of on the compact set and on bounding the difference
[TABLE] 3. 3.
We now handle the term (5.12)—this is where we apply Lusin’s theorem and the majorization property given in Assumption 2.1(M). Corresponding to , let us choose the element (cf. the definition of the set given in Step (ii)). By the definition of the set (cf. Lemma 5.1 in Step (ii)), the function is continuous on the closed set , where , and . We handle the continuous part of separately from the rest of . Specifically, we first consider the restriction of to the closed set , which is a lower semicontinuous function on in view of the property of given in Assumption 2.1(M). We apply the Tietze–Urysohn extension theorem [11, Thm. 2.6.4] to extend this function to a function on the entire space that is nonnegative, lower semicontinuous, and also bounded above by . Since , by [19, Prop. E.2],
[TABLE]
We then handle the difference between and . These two functions differ only outside the set . By using the fact and (cf. Assumption 2.1(M)), the majorization property given in Lemma 5.3, and the bounds , from Step 2, we can calculate that
[TABLE] 4. 4.
Finally, putting all the pieces together gives us the inequality
[TABLE]
By letting so that , the desired relation (5.11) follows for all and this implies (5.9).
Proving (5.10): The proof of (5.10) is similar to the above and essentially the same as that given in [40, Sect. 4, proofs of Lems. 4.4, 4.10]. We outline the main arguments below (see [40] for detailed derivations):
Consider an arbitrary . Let , for some arbitrarily large . Proceed as in Step 2 of the proof of (5.9) to choose large enough so that for the compact set in Assumption 2.1(SU), we have for all and . 2. 2.
Define a function on . Corresponding to the chosen and , choose the element and let . By the definition of the set (cf. Lemma 5.1 in Step (ii)), and on the closed set , is continuous. Then, since is also continuous on the closed set (cf. Assumption 2.1(M)) and is a bounded continuous function, we have, by [3, Prop. 7.30], that the function is continuous on the closed set . We now treat the continuous part of separately: by the Tietze–Urysohn extension theorem [11, Thm. 2.6.4], the restriction of to can be extended to a bounded continuous function on the entire space , with . Since , we have
[TABLE]
We then handle the difference between and . These two functions differ only outside the set . By using the fact and (cf. Assumption 2.1(M)), the majorization property given in Lemma 5.3, and the bounds , from Step 1, we can calculate that
[TABLE] 3. 3.
Finally, putting all the pieces together gives us the bound
[TABLE]
By letting so that , the desired relation (5.10) follows.
The lemma now follows from (5.9)-(5.10), as discussed earlier. ∎
By Lemma 5.4, \big{(}(1,{\it 0}),\,\underline{\rho}\big{)}=\big{(}L\bar{\gamma},\,\langle\bar{\gamma},\,c\rangle+\bar{r}\big{)} for . Thus \big{(}(1,{\it 0}),\,\underline{\rho}\big{)}\in H and consequently, . This completes the proof of Theorem 3.1.
5.1.2 Proof of Prop. 3.1
The proof is similar to that of [20, Chap. 12.4B, Thm. 12.4.2(c)]. Since is a maximizing sequence of (), for all , is feasible for ():
[TABLE]
By assumption and for each , . The latter implies
[TABLE]
by Fatou’s lemma. So, letting and taking limit superior on both sides of (5.13), we obtain
[TABLE]
which is the desired inequality and also shows that is finite everywhere.
Next, we prove the ACOE for -a.a. states. Since is a stationary minimum pair and by assumption, we have
[TABLE]
and hence
[TABLE]
This together with (5.14) implies that for -a.a. ,
[TABLE]
which in turn implies that for -a.a. ,
[TABLE]
Then, by (5.14), equality must hold in (5.15), and this gives the desired ACOE (3.14) and (3.15). The proof of Prop. 3.1 is now complete.
5.2 Proofs for Section 4
5.2.1 Proof of Theorem 4.1 (Outline)
The proof of Theorem 4.1 is similar to that of Theorem 2.1 on stationary minimum pairs for an unconstrained MDP. The latter proof is given in our prior work [40, Sect. 4.1, proofs of Prop. 3.2 and Thm. 3.3], and its main arguments have already been explained earlier in the proof of Lemma 5.4. So we will only outline the proof of Theorem 4.1, in order to avoid repetition. We will first state some of our prior results for unconstrained MDPs. We will then directly apply them to the present case of constrained MDPs.
In [40, Sect. 4.1] we considered two kinds of sequences . In the first case, are the occupancy measures of a policy , for an initial distribution that satisfies :
[TABLE]
In the second case, corresponds to a sequence of stationary pairs that satisfy :
[TABLE]
In both cases, , which, together with the strict unboundedness condition in Assumption 2.1(SU), implies that (i) is tight and for the compact sets in Assumption 2.1(SU), as , uniformly in ; and (ii) a weakly convergent subsequence can be extracted from any subsequence of : . For both cases, the limiting probability measure is proved to have the following properties, by using (i)-(ii) and the majorization condition in Assumption 2.1(M):
- (a)
corresponds to a stationary pair (i.e., ).
- (b)
The average cost of the pair satisfies
[TABLE]
We now explain how we can apply these results to prove Theorem 4.1 for the constrained MDP. To prove Theorem 4.1(i), we consider defined by (5.16) for a pair . By the feasibility of , its average costs are all finite:
[TABLE]
Since at least one of the one-stage cost functions is strictly unbounded by Assumption 4.1(SU), this implies that is a tight family of probability measures on and for the compact sets in Assumption 4.1(SU), the convergence as is uniform in . We then proceed as in the unconstrained case to obtain, from a weakly convergent subsequence of , the limiting probability measure . Next, using the majorization condition in Assumption 4.1(M), it follows as before that has the property (a) given above and gives us a stationary pair . Moreover, because Assumption 4.1(M) is the same as Assumption 2.1(M) holding for every one-stage cost function in the constrained MDP, (5.18) in the property (b) above now holds with the function replaced by every ; that is
[TABLE]
Since , it follows that
[TABLE]
This proves Theorem 4.1(i).
To prove Theorem 4.1(ii), which asserts the existence of a stationary optimal pair, we consider a sequence of stationary pairs with (there exists such a sequence by the part (i) just proved). Let be defined as in (5.17). Then, since for all , we have
[TABLE]
Since at least one of the functions is strictly unbounded under our assumption, as in the proof of the part (i), we can extract a weakly convergent subsequence of and from its limiting probability measure , we can obtain a stationary pair such that for all ,
[TABLE]
Since and is feasible for the constrained problem, (5.19) implies
[TABLE]
Hence is a stationary optimal pair for the constrained MDP.
We now prove Theorem 4.1(iii), which asserts the existence of a stationary lexicographically optimal pair. First, let us define recursively sets and scalars as follows: Let
[TABLE]
and for , let
[TABLE]
Then and consists of all the lexicographically optimal pairs. So, to prove Theorem 4.1(iii), we need to show . By Theorem 4.1(ii) just proved, . Let us prove by induction that for all .
Assume that for some , . Then is well-defined, and there exists a sequence of policy and initial distribution pairs with
[TABLE]
By Theorem 4.1(i) proved earlier, for each , there is a stationary pair with
[TABLE]
This together with the fact implies that . Consider now the sequence of stationary pairs thus constructed. Exactly the same proof arguments for establishing the part (ii) can be applied here, and they yield that there exists a stationary pair that satisfies (5.19). Therefore,
[TABLE]
and consequently, . This proves ; then, by induction, . Hence there is a stationary lexicographically optimal pair for the constrained MDP.
This completes the proof of Theorem 4.1.
5.2.2 Proof of Theorem 4.2 (Outline)
The consistency and solvability of (P) follow from Theorem 4.1(i)-(ii). The consistency of () is trivial (e.g., let , ). Thus, 0\leq\sup(\text{\text{P}^{*}})\leq\inf(\text{P})=\rho_{c}^{*}.
We now prove the absence of a duality gap. This proof is similar to that of Theorem 3.1(ii) for the unconstrained MDP case. Since the value of () is finite, by [2, Thm. 3.3] (cf. Theorem 2.2), the value of () equals the subvalue of (P). Therefore, to prove there is no duality gap is to prove . For this, it suffices to show
[TABLE]
where the set is as defined in (2.4) and, for the case here, is given by
[TABLE]
Recall that by definition the subvalue \underline{\rho}=\inf\big{\{}r\mid\big{(}(1,{\it 0},\kappa),r\big{)}\in\mkern 2.5mu\overline{\mkern-3.5muH\mkern-0.7mu}\mkern 0.7mu\big{\}} (cf. Section 2.2).
To prove \big{(}(1,{\it 0},\kappa),\underline{\rho}\big{)}\in H, we will construct a stationary pair with , and the proof proceeds in four steps as in the proof of Theorem 3.1(ii). Let us outline these steps, explaining briefly some minor changes in the details of the arguments.
Step (i): From the definition of , it follows that \big{(}(1,{\it 0},\kappa),\underline{\rho}\big{)}\in\mkern 2.5mu\overline{\mkern-3.5muH\mkern-0.7mu}\mkern 0.7mu and there exist a direct set and a net in such that
[TABLE]
As before, in view of (5.21) and the fact , by redefining the net if necessary, we may assume that every in the above is a probability measure on .
Step (ii): Similarly to Lemma 5.2, we extract a sequence such that
[TABLE]
where and in (5.25) are two chosen countable subsets of , the properties of which are needed in the subsequent two steps of our proof. In particular, the set is the countable set of bounded continuous functions with the property (5.5), the same set as defined in the proof of Theorem 3.1(ii). The countable set is also defined by the equation (5.6) in that proof:
[TABLE]
However, while the set is defined in the same way as before, we define the set slightly differently here, to take into account the multiple one-stage cost functions in the constrained MDP. Specifically, in the definition of (cf. Lemma 5.1 and the definitions preceding this lemma), we make the following changes. We now use the sets and finite measures involved in Assumption 4.1(M) instead of Assumption 2.1(M). We choose the sets for each such that besides the property in Lemma 5.1(i), we have that restricted to , all the truncated one-stage cost functions, , , are continuous (where ). This is possible by Lusin’s theorem (since we have only a finite number of these cost functions, we can apply Lusin’s theorem to each one of them and then combine the results).
Step (iii): This step is the same as before. The relations (5.26)-(5.27) together with Assumption 4.1(SU) imply that is a tight family of probability measures and therefore has a weakly convergent subsequence . Consider the corresponding subsequence ; for notational simplicity, we will drop the subscript by redefining to be this subsequence. Now, denote the limit of by , and decompose as , where is the marginal of on and is a stationary policy. Then, using Assumption 4.1(M) instead of Assumption 2.1(M), we have that Lemma 5.3 holds as before, which gives us the desired majorization properties for and that we will need in the next, last step.
Step (iv): This step is almost the same as before, except that we apply those arguments in the proof of (5.9) to every cost function , , in the present constrained problem. Then, similar to Lemma 5.4, we obtain that the pair is a stationary pair and satisfies that
[TABLE]
Combining this with (5.26) and (5.27) (recall also ), we obtain
[TABLE]
Therefore, if we let
[TABLE]
then
[TABLE]
This implies (since it implies , whereas ). Hence there is no duality gap between (P) and ().
5.2.3 Proofs of Props. 4.1 and 4.2
Proof of Prop. 4.1.
(i) Consider any and some pair with . By Theorem 4.1(i), there exists a stationary pair with for all . Then .
Now for each , since is feasible for (), we have from (4.10)-(4.11) that and for all ,
[TABLE]
and therefore, by adding to both sides,
[TABLE]
Integrate both sides of (5.28) w.r.t. the probability measure . Notice that since is a stationary pair. We thus obtain
[TABLE]
Take . Since is a maximizing sequence for (), by Theorem 4.2(ii). It then follows from (5.29) that
[TABLE]
where we used the fact and for all to derive (5.31). Since , (5.31) implies . Hence the sequence is bounded.
(ii) In this case, suppose is such that . Then and (5.31) holds with and with its left-hand side equal to . This yields .
(iii) In this case, by assumption there is some pair satisfying
[TABLE]
As in part (i), let us consider a stationary pair with for all . Such a pair exists by Theorem 4.1(i), since we can apply this theorem with a different feasible set instead of and in we can use as the upper limits on the average costs w.r.t. for , for instance.
The average costs of this stationary pair thus satisfy
[TABLE]
We also have, as in part (i), that (5.30) holds for this pair . Now, as we proved in part (i), is bounded for every . This together with the second relation in (5.32) implies that the term
[TABLE]
is finite. From (5.30), we have the inequality
[TABLE]
In (5.33), since the term on the left-hand side and the first term on the right-hand side are both finite, the second term on the right-hand side must satisfy
[TABLE]
Then, since , in view of the first relation in (5.32), the preceding inequality implies that must be bounded for every . Combining this with the result of part (i), we obtain that for every , the sequence is bounded. Hence is bounded. ∎
Proof of Prop. 4.2.
The proof arguments are similar to those of [17, Thm. 5.2(b)] for constrained MDPs and those of Prop. 3.1 for unconstrained MDPs. By the feasibility of for (), we have the inequality (5.28); that is, for each ,
[TABLE]
Let . Since and by assumption, we obtain
[TABLE]
For each , it follows from the assumption and Fatou’s lemma that
[TABLE]
Combining the preceding two relations gives us the desired inequality (4.13):
[TABLE]
which also shows that is finite everywhere.
Next, corresponding to the stationary optimal pair , let and integrate both sides of (5.34) w.r.t. the probability measure . As in the proof of Prop. 3.1, here the integrability is ensured by our assumption and the invariance property of , which also imply that -\infty<\int_{\mathbb{X}}h^{*}(x)\,dp^{*}=\int_{\Gamma}\int_{\mathbb{X}}h^{*}(y)\,q(dy\,|\,x,a)\,\gamma^{*}\big{(}d(x,a)\big{)}<+\infty. We thus obtain
[TABLE]
But and the second term in the right-hand side above is nonpositive, so equality must hold in the above inequality. This result can be equivalently expressed as
[TABLE]
Similarly to the proof of Prop. 3.1, the preceding equality together with the inequality (5.34) implies that for -a.a. ,
[TABLE]
This gives the desired ACOE (4.14) and (4.15). ∎
Acknowledgments
The author would like to thank Professor Eugene Feinberg and the anonymous reviewer for their comments that helped her improve the paper, and Dr. Martha Steenstrup for reading parts of the paper and giving her advice on improving the presentation. This research was supported by grants from DeepMind, Alberta Machine Intelligence Institute (AMII), and Alberta Innovates—Technology Futures (AITF).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Altman [1999] Altman, E. (1999). Constrained Markov Decision Processes . Chapman and Hall/CRC, Boca Raton, FL.
- 2Anderson and Nash [1987] Anderson, E. J. and Nash, P. (1987). Linear Programming in Infinite-Dimensional Spaces . John Wiley & Sons, Chichester, UK.
- 3Bertsekas and Shreve [1978] Bertsekas, D. P. and Shreve, S. E. (1978). Stochastic Optimal Control: The Discrete Time Case . Academic Press, New York.
- 4Billingsley [1968] Billingsley, P. (1968). Convergence of Probability Measures . John Wiley & Sons, New York.
- 5Blackwell and Ryll-Nardzewski [1963] Blackwell, D. and Ryll-Nardzewski, C. (1963). Non-existence of everywhere proper conditional distributions. Ann. Math. Statist. , 34:223–225.
- 6Borkar [1988] Borkar, V. S. (1988). A convex analytic approach to MD Ps. Probab. Th. Rel. Fields , 78:583–602.
- 7Borkar [1994] Borkar, V. S. (1994). Ergodic control of Markov chains with constraints–the general case. SIAM J. Control Optim. , 32:176–186.
- 8Denardo [1970] Denardo, E. V. (1970). On linear programing in a Markov decision problem. Manag. Sci. , 16:281–288.
