`Controlled' versions of the Collatz-Wielandt and Donsker-Varadhan formulae
Ari Arapostathis, Vivek S. Borkar

TL;DR
This paper reviews how risk-sensitive costs and rewards can be characterized using abstract Collatz-Wielandt and Donsker-Varadhan formulas, providing linear and dynamic programming tools for finite state-action systems.
Contribution
It introduces controlled versions of these formulas, enabling new linear and dynamic programming approaches for risk-sensitive decision-making in finite systems.
Findings
Provides a unified framework for risk-sensitive costs and rewards
Derives linear programming formulations for finite state-action systems
Introduces controlled Donsker-Varadhan formula for rewards
Abstract
This is an overview of the work of the authors and their collaborators on the characterization of risk sensitive costs and rewards in terms of an abstract Collatz-Wielandt formula and in case of rewards, also a controlled version of the Donsker-Varadhan formula. For the finite state and action case, this leads to useful linear and dynamic programming formulations in the reducible case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBenford’s Law and Fraud Detection
‘Controlled’ versions of the Collatz–Wielandt
and Donsker–Varadhan formulae
Ari Arapostathis∗
∗Department of Electrical and Computer Engineering, The University of Texas at Austin, EER 7.824, Austin, TX 78712
and
Vivek S. Borkar*‡*
*‡*Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai 400076, India
Abstract.
This is an overview of the work of the authors and their collaborators on the characterization of risk sensitive costs and rewards in terms of an abstract Collatz–Wielandt formula and in case of rewards, also a controlled version of the Donsker–Varadhan formula. For the finite state and action case, this leads to useful linear and dynamic programming formulations in the reducible case.
Key words and phrases:
Risk-sensitive criterion, Donsker–Varadhan functional, Collatz–Wielandt formula, principal eigenvalue
Key words and phrases:
principal eigenvalue and risk-sensitive control and Collatz–Wielandt formula and Donsker–Varadhan functional
2000 Mathematics Subject Classification:
Primary 60J60, Secondary 60J25, 35K59, 35P15, 60F10
1. Introduction
This short article is an overview of the work of authors and their collaborators on a somewhat novel perspective of the risk-sensitive control problem on infinite time horizon that aims to optimize the asymptotic growth rate of a mean exponentiated total reward, resp., cost. The viewpoint taken here is based on the fact that the dynamic programming principle for this problem essentially reduces it to an eigenvalue problem seeking the principal eigenvalue and eigenvector for a monotone positively -homogeneous operator. This allows us to exploit the existing generalized Perron–Frobenius (or Krein–Rutman) theory which leads to some explicit expressions for the optimal growth rate. The first is the abstract Collatz–Wielandt formula which can be shown to hold for both cost minimization and reward maximization problems, though we have not exhausted all the cases in our work. The second is a variational formula for the principal eigenvalue that generalizes the Donsker–Varadhan formula for the same in the linear case. This seems workable only for the reward maximization problem.
We first consider the discrete time case based on the results of [Ananth] in the next two sections, followed by those for reflected diffusions in a bounded domain, based on [ABK], in section 4. We then sketch, in section 5, the very recent and highly nontrivial extensions to diffusions on the whole space developed in [AriAnup] and [AABK]. Finally, we recall in section 6 some developments in the simple finite state-action set up from [CDC], where the aforementioned development allows us to derive the dynamic programming equations for risk-sensitive reward process in the reducible case. Section 7 concludes by highlighting some future directions.
2. Discrete time problems
The celebrated Courant–Fisher formula for the principal eigenvalue of a positive definite symmetric matrix is
[TABLE]
Consider an irreducible nonnegative matrix . The Perron–Frobenius theorem guarantees a positive principal eigenvalue with an associated positive eigenvector for . Is there a counterpart of the Courant–Fisher formula for this eigenvalue?
The answer is a resounding ‘YES’! It is the Collatz-Wielandt formula for the principal eigenvalue of an irreducible nonnegative matrix , stated as (see [Meyer] Chapter 8):
[TABLE]
An alternative characterization can be given as follows. Write
[TABLE]
with a stochastic matrix. In other words, we have pulled out the row sums of into a diagonal matrix so that what is left is a stochastic matrix . Also define
[TABLE]
Then the following representation holds [Dembo]:
[TABLE]
where denotes the Kullback–Leibler divergence or relative entropy. This is the finite state counterpart of the Donsker–Varadhan formula [DoVa] for the principal eigenvalue of a nonnegative matrix.
As is well known, the infinite dimensional generalization of the Perron–Frobenius theorem is given by the Krein–Rutman theorem [Krein, Pagter]. There are also nonlinear variants of it. Let
- (1)
be a Banach space with a ‘positive cone’ such that is dense in , 2. (2)
be a compact order preserving (i.e., ), strictly increasing (i.e., , strongly positive (i.e., maps nonzero elements of to its interior), positively -homogeneous (i.e., for all ) operator.
A nonlinear variant of the Krein–Rutman theorem [Ogiwara] then asserts that under some technical hypotheses, a unique positive principal eigenvalue and a corresponding unique (up to a scalar multiple) positive eigenvector for exist.
Our interest is in the following nonlinear scenario arising in risk-sensitive control: Consider
- •
a controlled Markov chain on a compact metric state space ;
- •
an associated control process in a compact metric control space ;
- •
a per stage reward function such that ;
- •
a controlled transition kernel with full support, such that for all Borel ,
[TABLE]
This is called the controlled Markov property and the controls for which this holds are said to be admissible. The maps
[TABLE]
are assumed to be equicontinuous.
The control problem is to maximize the asymptotic growth rate of the exponential reward:
[TABLE]
The second supremum in this definition is over all admissible controls. We allow relaxed (i.e., probability measure valued) controls taking values in , in which case (1) gets replaced by
[TABLE]
Define
[TABLE]
This is a compact, order preserving, strictly increasing, strongly positive, positively -homogeneous operator.
Using the nonlinear variant of the Krein–Rutman theorem stated above, this leads to an abstract Collatz-Wielandt formula [Ananth]:
Theorem 1**.**
There exist such that and
[TABLE]
Also, is the optimal reward for the risk-sensitive control problem.
3. Variational Formula
We now state a variational formula for the principal eigenvalue [Ananth]. Let denote the set of probability measures
[TABLE]
which disintegrate as
[TABLE]
such that is invariant under the transition kernel
[TABLE]
These are the so called ‘ergodic occupation measures’ for discrete time control problems.
Theorem 2**.**
Under the above hypotheses,
[TABLE]
This can be viewed as a controlled version of the Donsker–Varadhan formula. The hypotheses above can be relaxed to:
- (1)
Range with ; 2. (2)
need not have full support.
The formula then is the same as before, the difference is that under the previous, stronger set of conditions, the supremum over in the definition of was redundant, it is no longer so. The extension proceeds via an approximation argument that approximates the given transition kernel by a sequence of transition kernels for which our original hypotheses hold.
We thus have an equivalent concave maximization problem, in fact a linear program, as opposed to a ‘team’ problem one would obtain from the usual ‘log transformation’ as in, e.g., [Flem]. Furthermore, if denotes the asymptotic growth rate for a randomized Markov control , then it can be shown that , implying the sufficiency of randomized Markov controls.
Some applications worth noting are [Ananth]:
- (1)
Growth rate of the number of directed paths in a graph. This requires as a possible reward to account for the absence of edges. 2. (2)
Portfolio optimization in the framework of [Bielecki]. 3. (3)
Problem of minimizing the exit rate from a domain.
4. Reflected diffusions
Analogous results hold for reflected diffusions in a compact domain with smooth boundary. These are described by the stochastic differential equation
[TABLE]
for . Here:
- (1)
is an open connected and bounded set with boundary ; 2. (2)
is a standard -dimensional Wiener process; 3. (3)
the control lives in a metrizable compact action space and is non-anticipative, i.e., for , is independent of ; 4. (4)
is continuous, and is Lipschitz uniformly in ; 5. (5)
is and uniformly non-degenerate; 6. (6)
where is the unit outward normal on .
In contrast to the preceding section, we first consider the cost minimization problem to highlight the differences with the reward maximization problem. Unlike the classical cost/reward criteria such as discounted and average cost/reward, the risk-sensitive cost and reward problems are not rendered equivalent by a mere sign flip, and the differences are stark. For cost minimization, the control problem is to minimize
[TABLE]
where is continuous.
The corresponding ‘Nisio semigroup’ is defined as follows. For , let
[TABLE]
Then is a semigroup of strongly continuous, bounded Lipschitz, monotone, superadditive, positively 1-homogeneous, strongly positive operators with infinitesimal generator defined by
[TABLE]
Let
[TABLE]
As in the discrete case, the nonlinear Krein–Rutman theorem then leads to: There exists a unique pair satisfying such that
[TABLE]
This solves
[TABLE]
The abstract Collatz-Wielandt formula for this problem is
[TABLE]
In the uncontrolled case, the first formula above is the convex dual of the Donsker–Varadhan formula for the principal eigenvalue of :
[TABLE]
where
[TABLE]
For the risk-sensitive reward problem, the same abstract Collatz-Wielandt formula holds, except that the definition of the operator now has a ‘’ in place of the ‘’. But as in the discrete time case, one can go a step further and have a variational formulation. Let
[TABLE]
and
[TABLE]
with
[TABLE]
for . Recall the definition of an ‘ergodic occupation measure’ [ABG]. For a stochastic differential equation as in (2), but with the drift replaced with , and taking values in some compact metrizable space, it is the time- marginal of a stationary state-control process \bigl{(}X_{t},v(X_{t}),w(X_{t})\bigr{)}, perforce independent of . Thus, in the case the parameter lives in a compact space, by a standard characterization of ergodic occupation measures (ibid.), is precisely the set thereof for controlled diffusions whose (controlled) extended generator is . This however is not necessarily the case if lives in . An example to keep in mind is the one-dimensional stochastic differential equation
[TABLE]
It is straightforward to verify that the standard Gaussian density satisfies the Fokker–Planck equation. However, the diffusion is not even regular, so it does not have an invariant probability measure. Therefore, we refer to as the set of infinitesimal ergodic occupation measures. The variational formula for this model is
[TABLE]
This result is from [AABK].
An analogous abstract Collatz–Wielandt formula for the risk-sensitive cost minimization problem was derived in [ABK]. We have not derived a corresponding variational formula. Even if one were to do so, it is clear that it will be a ‘sup-inf / inf-sup’ formula rather than a pure maximization problem. This is already known through a different route: it forms the basis of the approach initiated by [Flem] and followed by many, in which the the Hamilton–Jacobi–Bellman equation for the risk-sensitive cost minimization problem is converted to an Isaacs equation for an ergodic payoff zero sum stochastic differential game. The aforementioned expression then is simply the value of this game. Going by pure analogy, for the reward maximization problem, one would expect this route to yield a stochastic team problem wherein the two agents seek to maximize a common payoff, but non-cooperatively, i.e., without either of them having knowledge of the other person’s decision. What this translates into is that under the corresponding ergodic occupation measure, the two control actions are conditionally independent given the state. The set of such measures is non-convex. What we have achieved instead is a single concave programming problem, which is a significant simplification from the point of view of developing computational schemes for the problem. This also brings to the fore the difference between reward maximization and cost minimization in risk-sensitive control.
5. Diffusions on the whole space
Here we consider a controlled diffusion in of the form
[TABLE]
where
- (1)
is a standard -dimensional Brownian motion; 2. (2)
the control lives in a metrizable compact action space and is non-anticipative, i.e., for , is independent of ; 3. (3)
is continuous and locally Lipschitz continuous in uniformly in ; 4. (4)
is locally Lipschitz continuous and locally nondegenerate; 5. (5)
and have at most affine growth in .
Without loss of generality, we may take to be adapted to the increasing -fields generated by . Then these hypotheses guarantee the existence of a unique weak solution for any admissible control ([ABG], Chapter 2).
As before, we let be a continuous running reward function, which is locally Lipschitz in uniformly in , and is also bounded from above in . We define the optimal risk-sensitive value by
[TABLE]
where the supremum is over all admissible controls.
Consider the extremal operator
[TABLE]
for . The generalized principal eigenvalue of is defined by
[TABLE]
where denotes the local Sobolev space of functions on whose generalized derivatives up to order are in , equipped with its natural semi-norms. We assume that is negative and bounded from above away from zero on the complement of some compact set. This is always satisfied if is an inf-compact function, that is the sublevel sets are compact (or empty) in for each , or if is a positive function vanishing at infinity and the process is recurrent under some stationary Markov control. Then there exists a unique positive normalized as which solves . In other words, the eigenvalue is simple. Let . As shown in [AABK], the function
[TABLE]
is an infinitesimal relative entropy rate.
We let , and use the single variable . Let denote the set of probability measures on the Borel -algebra of , and denote the set of infinitesimal ergodic occupation measures for the operator in (4) defined for , which here can be written as
[TABLE]
where is the class of functions in which have compact support. Recall the definition in Section 4. We also define
[TABLE]
The following is a summary of the main results in [AABK, Section 4].
Theorem 3**.**
We have
[TABLE]
Suppose that the diffusion matrix is bounded and uniformly elliptic, and either is inf-compact, or has subquadratic growth, or is bounded. Then , and may be replaced by in the variational formula above. If, in addition, is bounded, then
[TABLE]
We continue with the Collatz–Wielandt formula in for the risk-sensitive cost minimization problem. This is studied in [AriAnup]. Here, we have a running cost which is bounded from below in , and is locally Lipschitz in uniformly in . The assumptions on and are as stated in the beginning of the section, except that we may replace the affine growth assumption with the more general condition
[TABLE]
for some constant . The risk-sensitive optimal value is defined by
[TABLE]
The operator here is as in (3) but for , and we let the generalized principal eigenvalue be defined as in (5).
The running cost does not have any structural properties that penalize unstable behavior such as near-monotonicity or inf-compactness, so uniform ergodicity for the controlled process needs to be assumed. Let
[TABLE]
We consider the following hypothesis.
Assumption 1**.**
The following hold.
- (i)
There exists an inf-compact function , and a positive function , satisfying , such that
[TABLE]
for some constant and a compact set .
- (ii)
The function is inf-compact for some .
As noted in [ABS-19], the Foster–Lyapunov equation in (6) cannot in general be satisfied for diffusions with bounded and . Therefore, to treat this case, we consider an alternate set of conditions.
Assumption 2**.**
The following hold.
- (i)
There exists a positive function , satisfying , constants and , and a compact set such that
[TABLE]
- (ii)
.
Let denote the class of continuous functions that grow slower than , that is, as . We quote the following result from [ABS-19].
Theorem 4**.**
Grant either Assumption 1, or 2. Then
[TABLE]
where denotes the set of positive functions in .
We should remark here that the class of test functions in the first representation formula in (7) cannot, in general, be enlarged to .
It is also interesting to consider the substitution . Then (7) transforms to
[TABLE]
with
[TABLE]
This underscores the discussion in the last paragraph of section 4.
6. Finite state and action space
For discrete time problems with finite state and action spaces (i.e., in sections 2-3), one can go significantly further for the reward maximization problem. We recall below some results in this context from [CDC].
Consider a controlled Markov chain on with state-dependent action space at state given by:
[TABLE]
where
[TABLE]
This is isomorphic to . Let
[TABLE]
The (controlled) transition probabilities of are
[TABLE]
Define the per stage reward by:
[TABLE]
Let denote the -valued control process. Consider the problem: Maximize the long run average reward
[TABLE]
Define the corresponding ergodic occupation measure by
[TABLE]
where is an invariant probability distribution (not necessarily unique) under the transition kernel
[TABLE]
Let denote the set of such . The above average reward control problem is equivalent to the linear program:
P0 Maximize
[TABLE]
over .
Recall that is specified by linear constraints and its extreme points correspond to stationary Markov policies ([BorkarMC], Chapter V). The maximum will be attained at an extreme point of corresponding to a stationary Markov policy. This LP can be simplified as:
Maximize
[TABLE]
over
[TABLE]
The dual LP is:
Minimize subject to
[TABLE]
The proof goes through finite approximations. Note that the LP has infinitely many constraints. However, it does pave the way for the corresponding dynamic programming principle. The dynamic programming formulation equivalent to the above LP turns out to be as follows:
[TABLE]
for all , where is the Argmax in (). Once again, the proof goes through finite approximations. The maximization over in () can be explicitly performed using the ‘Gibbs variational principle’ from statistical mechanics. For fixed the maximum is attained at
[TABLE]
Substitute back, setting
[TABLE]
and exponentiate both sides of (). This leads to the multiplicative dynamic programming equations for infinite horizon risk-sensitive reward in the general degenerate case:
[TABLE]
for all , where is the Argmax in (). This is the analog of the Howard–Kallenberg results for ergodic or ‘average reward’ control ([Puterman], Chapter 9). Observe the occurrence of the ‘twisted kernel’, which sets it apart from the average reward case.
7. Future directions
There are several directions left uncharted in this broad problem area. Some of them are listed below.
- (1)
There are some in-between cases that need to be analyzed, e.g., controlled Markov chains with countably infinite state space. Under the strong ‘Doeblin condition’, the abstract Collatz-Wielandt formula has been derived for these in [Cavazos]. This needs to be extended to more general cases. 2. (2)
The counterpart of the dynamic programming equations derived for reducible risk-sensitive reward processes can also be expected to hold for risk-sensitive cost problems and is yet to be established. 3. (3)
Concrete computational schemes based on approximate concave maximization problems is another direction worth pursuing.
Acknowledgements
The work of A.A. was supported in part by the National Science Foundation through grant DMS-1715210, and in part the Army Research Office through grant W911NF-17-1-001. The work of V.S.B. was supported by a J. C. Bose Fellowship from the Government of India.
References
