`Controlled' versions of the Collatz-Wielandt and Donsker-Varadhan   formulae

Ari Arapostathis; Vivek S. Borkar

arXiv:1903.10714·math.OC·March 27, 2019

`Controlled' versions of the Collatz-Wielandt and Donsker-Varadhan formulae

Ari Arapostathis, Vivek S. Borkar

PDF

Open Access

TL;DR

This paper reviews how risk-sensitive costs and rewards can be characterized using abstract Collatz-Wielandt and Donsker-Varadhan formulas, providing linear and dynamic programming tools for finite state-action systems.

Contribution

It introduces controlled versions of these formulas, enabling new linear and dynamic programming approaches for risk-sensitive decision-making in finite systems.

Findings

01

Provides a unified framework for risk-sensitive costs and rewards

02

Derives linear programming formulations for finite state-action systems

03

Introduces controlled Donsker-Varadhan formula for rewards

Abstract

This is an overview of the work of the authors and their collaborators on the characterization of risk sensitive costs and rewards in terms of an abstract Collatz-Wielandt formula and in case of rewards, also a controlled version of the Donsker-Varadhan formula. For the finite state and action case, this leads to useful linear and dynamic programming formulations in the reducible case.

Equations145

λ = 0 \neq = x \in R^{d} max \frac{x ^{T} A x}{x ^{T} x} .

λ = 0 \neq = x \in R^{d} max \frac{x ^{T} A x}{x ^{T} x} .

λ

λ

= x = [x_{1}, \dots, x_{d}]^{T}, x_{i} > 0 \forall i in f i : x_{i} > 0 max (\frac{( Q x ) _{i}}{x _{i}}) .

Q

Q

Γ

p (i, j)

P

{\mathscr{G}}_{0}\,\coloneqq\,\bigl{\{}(\pi,\tilde{P})\colon\pi\text{\ is a stationary probability for the stochastic matrix\ }\tilde{P}=\bm{[}\tilde{p}(j|i)\bm{]}\bigr{\}}\,.

{\mathscr{G}}_{0}\,\coloneqq\,\bigl{\{}(\pi,\tilde{P})\colon\pi\text{\ is a stationary probability for the stochastic matrix\ }\tilde{P}=\bm{[}\tilde{p}(j|i)\bm{]}\bigr{\}}\,.

\log\lambda\,=\,\sup_{(\pi,\tilde{P})\,\in\,{\mathscr{G}}_{0}}\left(\sum_{i}\pi(i)\bigl{[}\kappa_{i}-D\bigl{(}\tilde{p}(\cdot\,|\,i)\|\,p(\cdot\,|\,i)\bigr{)}\bigr{]}\right),

\log\lambda\,=\,\sup_{(\pi,\tilde{P})\,\in\,{\mathscr{G}}_{0}}\left(\sum_{i}\pi(i)\bigl{[}\kappa_{i}-D\bigl{(}\tilde{p}(\cdot\,|\,i)\|\,p(\cdot\,|\,i)\bigr{)}\bigr{]}\right),

P (X_{n + 1} \in A ∣ X_{m}, Z_{m}, m \leq n)

P (X_{n + 1} \in A ∣ X_{m}, Z_{m}, m \leq n)

= p (A ∣ X_{n}, Z_{n}) .

(x, u) \mapsto \int f (y) p (d y ∣ x, u), f \in C (S), ∥ f ∥ \leq 1,

(x, u) \mapsto \int f (y) p (d y ∣ x, u), f \in C (S), ∥ f ∥ \leq 1,

\lambda\,\coloneqq\,\sup_{x\in S}\,\sup_{\{Z_{m}\}}\,\liminf_{N\uparrow\infty}\frac{1}{N}\log\operatorname{\mathrm{E}}\left[\mathrm{e}^{\sum_{m=0}^{N-1}r(X_{m},Z_{m},X_{m+1})}\Bigm{|}X_{0}=x\right].

\lambda\,\coloneqq\,\sup_{x\in S}\,\sup_{\{Z_{m}\}}\,\liminf_{N\uparrow\infty}\frac{1}{N}\log\operatorname{\mathrm{E}}\left[\mathrm{e}^{\sum_{m=0}^{N-1}r(X_{m},Z_{m},X_{m+1})}\Bigm{|}X_{0}=x\right].

P (X_{n + 1} \in A ∣ X_{m}, μ_{m}, m \leq n)

P (X_{n + 1} \in A ∣ X_{m}, μ_{m}, m \leq n)

= \int p (A ∣ X_{n}, z) μ_{n} (d z), n \geq 0 .

T f (x) : = ϕ : S \mapsto P (U) measurable sup \iint p (d y ∣ x, u) ϕ (d u ∣ x) e^{r (x, u, y)} f (y) .

T f (x) : = ϕ : S \mapsto P (U) measurable sup \iint p (d y ∣ x, u) ϕ (d u ∣ x) e^{r (x, u, y)} f (y) .

ρ

ρ

= f \in int (C^{+} (S)) sup M^{+} (S) in f \frac{\int T f d μ}{\int f d μ} .

η (d x, d u, d y) \in P (S \times U \times S)

η (d x, d u, d y) \in P (S \times U \times S)

η (d x, d u, d y) = η_{0} (d x) η_{1} (d u ∣ x) η_{2} (d y ∣ x, u),

η (d x, d u, d y) = η_{0} (d x) η_{1} (d u ∣ x) η_{2} (d y ∣ x, u),

\int_{U} η_{2} (d y ∣ x, u) η_{1} (d u ∣ x) .

\int_{U} η_{2} (d y ∣ x, u) η_{1} (d u ∣ x) .

\log\rho\,=\,\sup_{\eta\in{\mathscr{G}}}\biggl{(}\iint\eta_{0}(\mathrm{d}x)\eta_{1}(\mathrm{d}u\,|\,x)\biggl{[}\int r(x,u,y)\eta_{2}(\mathrm{d}y\,|\,x,u)-\ D\bigl{(}\eta_{2}(\mathrm{d}y\,|\,x,u)\|\,p(\mathrm{d}y\,|\,x,u)\bigr{)}\biggr{]}\biggr{)}.

\log\rho\,=\,\sup_{\eta\in{\mathscr{G}}}\biggl{(}\iint\eta_{0}(\mathrm{d}x)\eta_{1}(\mathrm{d}u\,|\,x)\biggl{[}\int r(x,u,y)\eta_{2}(\mathrm{d}y\,|\,x,u)-\ D\bigl{(}\eta_{2}(\mathrm{d}y\,|\,x,u)\|\,p(\mathrm{d}y\,|\,x,u)\bigr{)}\biggr{]}\biggr{)}.

d X (t)

d X (t)

d ξ (t)

t ↑ \infty lim \frac{1}{t} lo g E [e^{\int_{0}^{t} r (X_{s}, U_{s}) d s}],

t ↑ \infty lim \frac{1}{t} lo g E [e^{\int_{0}^{t} r (X_{s}, U_{s}) d s}],

S_{t} f (x) : = {U_{t}}_{t \geq 0} in f E_{x} [e^{\int_{0}^{t} r (X_{s}, U_{s}) d s} f (X_{t})] .

S_{t} f (x) : = {U_{t}}_{t \geq 0} in f E_{x} [e^{\int_{0}^{t} r (X_{s}, U_{s}) d s} f (X_{t})] .

{\mathcal{G}}f(x)\,\coloneqq\,\frac{1}{2}\mbox{tr}\left(\upsigma(x)\upsigma^{\mathsf{T}}(x)\nabla^{2}f(x)\right)+\min_{u\in\mathbb{U}}\,\Bigl{[}\langle b(x,u)\,,\nabla f(x)\rangle+r(x,u)f(x)\Bigr{]}\,.

{\mathcal{G}}f(x)\,\coloneqq\,\frac{1}{2}\mbox{tr}\left(\upsigma(x)\upsigma^{\mathsf{T}}(x)\nabla^{2}f(x)\right)+\min_{u\in\mathbb{U}}\,\Bigl{[}\langle b(x,u)\,,\nabla f(x)\rangle+r(x,u)f(x)\Bigr{]}\,.

C^{2}_{\gamma,+}(\bar{Q})\,\coloneqq\,\bigl{\{}f\colon\bar{Q}\mapsto[0,\infty)\colon f\in C^{2}(\bar{Q}),\ \langle\nabla f(x),\gamma(x)\rangle=0\text{\ for\ }x\in\partial Q\bigr{\}}\,.

C^{2}_{\gamma,+}(\bar{Q})\,\coloneqq\,\bigl{\{}f\colon\bar{Q}\mapsto[0,\infty)\colon f\in C^{2}(\bar{Q}),\ \langle\nabla f(x),\gamma(x)\rangle=0\text{\ for\ }x\in\partial Q\bigr{\}}\,.

S_{t} φ = e^{ρt} φ .

S_{t} φ = e^{ρt} φ .

G φ (x) = ρφ (x), x \in Q, and ⟨ \nabla φ (x), γ (x)⟩ = 0, x \in \partial Q .

G φ (x) = ρφ (x), x \in Q, and ⟨ \nabla φ (x), γ (x)⟩ = 0, x \in \partial Q .

ρ

ρ

= f \in C_{γ, +}^{2} (\overset{ˉ}{Q}), f > 0 sup ν \in P (\overset{ˉ}{Q}) in f \int_{\overset{ˉ}{Q}} \frac{G f}{f} d ν .

ρ = ν \in P (\overset{ˉ}{Q}) sup (\int_{\overset{ˉ}{Q}} r (x) ν (d x) - I (ν)),

ρ = ν \in P (\overset{ˉ}{Q}) sup (\int_{\overset{ˉ}{Q}} r (x) ν (d x) - I (ν)),

I (ν) : = f \in C_{γ, +}^{2} (\overset{ˉ}{Q}), f > 0 in f \int_{\overset{ˉ}{Q}} (\frac{G f}{f}) d ν .

I (ν) : = f \in C_{γ, +}^{2} (\overset{ˉ}{Q}), f > 0 in f \int_{\overset{ˉ}{Q}} (\frac{G f}{f}) d ν .

R (x, u, w) : = r (x, u) - \frac{1}{2} ∣ \upsigma^{T} (x) w ∣^{2}, (x, u, w) \in \overset{ˉ}{Q} \times U \times R^{d},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBenford’s Law and Fraud Detection

Full text

‘Controlled’ versions of the Collatz–Wielandt

and Donsker–Varadhan formulae

Ari Arapostathis∗

∗Department of Electrical and Computer Engineering, The University of Texas at Austin, EER 7.824, Austin, TX 78712

[email protected]

and

Vivek S. Borkar*‡*

*‡*Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai 400076, India

[email protected]

Abstract.

This is an overview of the work of the authors and their collaborators on the characterization of risk sensitive costs and rewards in terms of an abstract Collatz–Wielandt formula and in case of rewards, also a controlled version of the Donsker–Varadhan formula. For the finite state and action case, this leads to useful linear and dynamic programming formulations in the reducible case.

Key words and phrases:

Risk-sensitive criterion, Donsker–Varadhan functional, Collatz–Wielandt formula, principal eigenvalue

Key words and phrases:

principal eigenvalue and risk-sensitive control and Collatz–Wielandt formula and Donsker–Varadhan functional

2000 Mathematics Subject Classification:

Primary 60J60, Secondary 60J25, 35K59, 35P15, 60F10

1. Introduction

This short article is an overview of the work of authors and their collaborators on a somewhat novel perspective of the risk-sensitive control problem on infinite time horizon that aims to optimize the asymptotic growth rate of a mean exponentiated total reward, resp., cost. The viewpoint taken here is based on the fact that the dynamic programming principle for this problem essentially reduces it to an eigenvalue problem seeking the principal eigenvalue and eigenvector for a monotone positively $1$ -homogeneous operator. This allows us to exploit the existing generalized Perron–Frobenius (or Krein–Rutman) theory which leads to some explicit expressions for the optimal growth rate. The first is the abstract Collatz–Wielandt formula which can be shown to hold for both cost minimization and reward maximization problems, though we have not exhausted all the cases in our work. The second is a variational formula for the principal eigenvalue that generalizes the Donsker–Varadhan formula for the same in the linear case. This seems workable only for the reward maximization problem.

We first consider the discrete time case based on the results of [Ananth] in the next two sections, followed by those for reflected diffusions in a bounded domain, based on [ABK], in section 4. We then sketch, in section 5, the very recent and highly nontrivial extensions to diffusions on the whole space developed in [AriAnup] and [AABK]. Finally, we recall in section 6 some developments in the simple finite state-action set up from [CDC], where the aforementioned development allows us to derive the dynamic programming equations for risk-sensitive reward process in the reducible case. Section 7 concludes by highlighting some future directions.

2. Discrete time problems

The celebrated Courant–Fisher formula for the principal eigenvalue of a positive definite symmetric matrix $A\in\mathbb{R}^{d\times d}$ is

[TABLE]

Consider an irreducible nonnegative matrix $Q\in\mathbb{R}^{d\times d}$ . The Perron–Frobenius theorem guarantees a positive principal eigenvalue with an associated positive eigenvector for $Q$ . Is there a counterpart of the Courant–Fisher formula for this eigenvalue?

The answer is a resounding ‘YES’! It is the Collatz-Wielandt formula for the principal eigenvalue of an irreducible nonnegative matrix $Q=\bm{[}q(i,j)\bm{]}\in\mathbb{R}^{d\times d}$ , stated as (see [Meyer] Chapter 8):

[TABLE]

An alternative characterization can be given as follows. Write

[TABLE]

with $P$ a stochastic matrix. In other words, we have pulled out the row sums $\{\kappa_{i}\}$ of $Q$ into a diagonal matrix $\varGamma$ so that what is left is a stochastic matrix $P$ . Also define

[TABLE]

Then the following representation holds [Dembo]:

[TABLE]

where $D(\cdot\,\|\,\cdot)$ denotes the Kullback–Leibler divergence or relative entropy. This is the finite state counterpart of the Donsker–Varadhan formula [DoVa] for the principal eigenvalue of a nonnegative matrix.

As is well known, the infinite dimensional generalization of the Perron–Frobenius theorem is given by the Krein–Rutman theorem [Krein, Pagter]. There are also nonlinear variants of it. Let

(1)

$B$ be a Banach space with a ‘positive cone’ $K$ such that $K-K$ is dense in $B$ , 2. (2)

$T\colon B\mapsto B$ be a compact order preserving (i.e., $f\geq g\Longrightarrow Tf\geq Tg$ ), strictly increasing (i.e., $f>g\Longrightarrow Tf>Tg)$ , strongly positive (i.e., maps nonzero elements of $K$ to its interior), positively $1$ -homogeneous (i.e., $T(af)=aTf$ for all $a>0$ ) operator.

A nonlinear variant of the Krein–Rutman theorem [Ogiwara] then asserts that under some technical hypotheses, a unique positive principal eigenvalue and a corresponding unique (up to a scalar multiple) positive eigenvector for $T$ exist.

Our interest is in the following nonlinear scenario arising in risk-sensitive control: Consider

•

a controlled Markov chain $\{X_{n}\}$ on a compact metric state space $S$ ;

•

an associated control process $\{Z_{n}\}$ in a compact metric control space $U$ ;

•

a per stage reward function $r\colon S\times U\times S\mapsto\mathbb{R}$ such that $r\in C(S\times U\times S)$ ;

•

a controlled transition kernel $p(\mathrm{d}y\,|\,x,u)$ with full support, such that for all Borel $A\subset S$ ,

[TABLE]

This is called the controlled Markov property and the controls for which this holds are said to be admissible. The maps

[TABLE]

are assumed to be equicontinuous.

The control problem is to maximize the asymptotic growth rate of the exponential reward:

[TABLE]

The second supremum in this definition is over all admissible controls. We allow relaxed (i.e., probability measure valued) controls $\{\mu_{n}\}$ taking values in ${\mathcal{P}}(S)$ , in which case (1) gets replaced by

[TABLE]

Define

[TABLE]

This is a compact, order preserving, strictly increasing, strongly positive, positively $1$ -homogeneous operator.

Using the nonlinear variant of the Krein–Rutman theorem stated above, this leads to an abstract Collatz-Wielandt formula [Ananth]:

Theorem 1.

There exist $\rho>0,\psi\in\mathrm{int}(C^{+}(S))$ such that $T\psi=\rho\psi$ and

[TABLE]

Also, $\log\rho$ is the optimal reward for the risk-sensitive control problem.

3. Variational Formula

We now state a variational formula for the principal eigenvalue [Ananth]. Let ${\mathscr{G}}$ denote the set of probability measures

[TABLE]

which disintegrate as

[TABLE]

such that $\eta_{0}$ is invariant under the transition kernel

[TABLE]

These are the so called ‘ergodic occupation measures’ for discrete time control problems.

Theorem 2.

Under the above hypotheses,

[TABLE]

This can be viewed as a controlled version of the Donsker–Varadhan formula. The hypotheses above can be relaxed to:

(1)

Range $(r)=[-\infty,\infty)$ with $\mathrm{e}^{r}\in C(S\times U\times S)$ ; 2. (2)

$p(\mathrm{d}y\,|\,x,u)$ need not have full support.

The formula then is the same as before, the difference is that under the previous, stronger set of conditions, the supremum over $x\in S$ in the definition of $\lambda$ was redundant, it is no longer so. The extension proceeds via an approximation argument that approximates the given transition kernel by a sequence of transition kernels for which our original hypotheses hold.

We thus have an equivalent concave maximization problem, in fact a linear program, as opposed to a ‘team’ problem one would obtain from the usual ‘log transformation’ as in, e.g., [Flem]. Furthermore, if $\rho(\varphi)$ denotes the asymptotic growth rate for a randomized Markov control $\varphi$ , then it can be shown that $\rho=\max_{\varphi}\rho(\varphi)$ , implying the sufficiency of randomized Markov controls.

Some applications worth noting are [Ananth]:

(1)

Growth rate of the number of directed paths in a graph. This requires $-\infty$ as a possible reward to account for the absence of edges. 2. (2)

Portfolio optimization in the framework of [Bielecki]. 3. (3)

Problem of minimizing the exit rate from a domain.

4. Reflected diffusions

Analogous results hold for reflected diffusions in a compact domain with smooth boundary. These are described by the stochastic differential equation

[TABLE]

for $t\geq 0$ . Here:

(1)

$Q$ is an open connected and bounded set with $C^{3}$ boundary $\partial Q$ ; 2. (2)

$\{W_{t}\}_{t\geq 0}$ is a standard $d$ -dimensional Wiener process; 3. (3)

the control $\{U_{t}\}_{t\geq 0}$ lives in a metrizable compact action space $\mathbb{U}$ and is non-anticipative, i.e., for $t>s$ , $W(t)-W(s)$ is independent of $X_{0};W_{y},U_{y},y\leq s$ ; 4. (4)

$b$ is continuous, and $x\mapsto b(x,u)$ is Lipschitz uniformly in $u$ ; 5. (5)

$\upsigma$ is $C^{1,\beta_{0}}$ and uniformly non-degenerate; 6. (6)

$\gamma_{i}(x)=\upsigma(x)\upsigma(x)^{\mathsf{T}}\eta(x)$ where $\eta(x)$ is the unit outward normal on $\partial Q$ .

In contrast to the preceding section, we first consider the cost minimization problem to highlight the differences with the reward maximization problem. Unlike the classical cost/reward criteria such as discounted and average cost/reward, the risk-sensitive cost and reward problems are not rendered equivalent by a mere sign flip, and the differences are stark. For cost minimization, the control problem is to minimize

[TABLE]

where $r$ is continuous.

The corresponding ‘Nisio semigroup’ is defined as follows. For $t\geq 0$ , let

[TABLE]

Then $S_{t}\colon C(\bar{Q})\mapsto C(\bar{Q})$ is a semigroup of strongly continuous, bounded Lipschitz, monotone, superadditive, positively 1-homogeneous, strongly positive operators with infinitesimal generator ${\mathcal{G}}$ defined by

[TABLE]

Let

[TABLE]

As in the discrete case, the nonlinear Krein–Rutman theorem then leads to: There exists a unique pair $(\rho,\varphi)\in\mathbb{R}\times C^{2}_{\gamma,+}(\bar{Q})$ satisfying $\|\varphi\|_{0,\bar{Q}}=1$ such that

[TABLE]

This solves

[TABLE]

The abstract Collatz-Wielandt formula for this problem is

[TABLE]

In the uncontrolled case, the first formula above is the convex dual of the Donsker–Varadhan formula for the principal eigenvalue of ${\mathcal{G}}$ :

[TABLE]

where

[TABLE]

For the risk-sensitive reward problem, the same abstract Collatz-Wielandt formula holds, except that the definition of the operator ${\mathcal{G}}$ now has a ‘ $\max$ ’ in place of the ‘ $\min$ ’. But as in the discrete time case, one can go a step further and have a variational formulation. Let

[TABLE]

and

[TABLE]

with

[TABLE]

for $f\in C^{2}(Q)\cap C(\bar{Q})$ . Recall the definition of an ‘ergodic occupation measure’ [ABG]. For a stochastic differential equation as in (2), but with the drift $b$ replaced with $b(x,u)+\upsigma(x)\upsigma^{\mathsf{T}}(x)w$ , and $w$ taking values in some compact metrizable space, it is the time- $t$ marginal of a stationary state-control process $\bigl{(}X_{t},v(X_{t}),w(X_{t})\bigr{)}$ , perforce independent of $t$ . Thus, in the case the parameter $w$ lives in a compact space, by a standard characterization of ergodic occupation measures (ibid.), ${\mathcal{M}}$ is precisely the set thereof for controlled diffusions whose (controlled) extended generator is $\mathcal{A}$ . This however is not necessarily the case if $w$ lives in $\mathbb{R}^{d}$ . An example to keep in mind is the one-dimensional stochastic differential equation

[TABLE]

It is straightforward to verify that the standard Gaussian density satisfies the Fokker–Planck equation. However, the diffusion is not even regular, so it does not have an invariant probability measure. Therefore, we refer to ${\mathcal{M}}$ as the set of infinitesimal ergodic occupation measures. The variational formula for this model is

[TABLE]

This result is from [AABK].

An analogous abstract Collatz–Wielandt formula for the risk-sensitive cost minimization problem was derived in [ABK]. We have not derived a corresponding variational formula. Even if one were to do so, it is clear that it will be a ‘sup-inf / inf-sup’ formula rather than a pure maximization problem. This is already known through a different route: it forms the basis of the approach initiated by [Flem] and followed by many, in which the the Hamilton–Jacobi–Bellman equation for the risk-sensitive cost minimization problem is converted to an Isaacs equation for an ergodic payoff zero sum stochastic differential game. The aforementioned expression then is simply the value of this game. Going by pure analogy, for the reward maximization problem, one would expect this route to yield a stochastic team problem wherein the two agents seek to maximize a common payoff, but non-cooperatively, i.e., without either of them having knowledge of the other person’s decision. What this translates into is that under the corresponding ergodic occupation measure, the two control actions are conditionally independent given the state. The set of such measures is non-convex. What we have achieved instead is a single concave programming problem, which is a significant simplification from the point of view of developing computational schemes for the problem. This also brings to the fore the difference between reward maximization and cost minimization in risk-sensitive control.

5. Diffusions on the whole space

Here we consider a controlled diffusion in $\mathbb{R}^{d}$ of the form

[TABLE]

where

(1)

$W$ is a standard $d$ -dimensional Brownian motion; 2. (2)

the control $U_{t}$ lives in a metrizable compact action space $\mathbb{U}$ and is non-anticipative, i.e., for $t>s$ , $W(t)-W(s)$ is independent of $X_{0};W_{y},U_{y},y\leq s$ ; 3. (3)

$b(x,u)$ is continuous and locally Lipschitz continuous in $x$ uniformly in $u\in\mathbb{U}$ ; 4. (4)

$\upsigma$ is locally Lipschitz continuous and locally nondegenerate; 5. (5)

$b$ and $\upsigma$ have at most affine growth in $x$ .

Without loss of generality, we may take $U_{t}$ to be adapted to the increasing $\sigma$ -fields generated by $\{X_{t},t\geq 0\}$ . Then these hypotheses guarantee the existence of a unique weak solution for any admissible control $\{U_{t}\}_{t\geq 0}$ ([ABG], Chapter 2).

As before, we let $r(x,u)$ be a continuous running reward function, which is locally Lipschitz in $x$ uniformly in $u$ , and is also bounded from above in $\mathbb{R}^{d}$ . We define the optimal risk-sensitive value $J^{*}$ by

[TABLE]

where the supremum is over all admissible controls.

Consider the extremal operator

[TABLE]

for $f\in{C}^{2}(\mathbb{R}^{d})$ . The generalized principal eigenvalue of $\widehat{\mathcal{G}}$ is defined by

[TABLE]

where ${\mathscr{W}}_{\text{loc}}^{2,d}(\mathbb{R}^{d})$ denotes the local Sobolev space of functions on $\mathbb{R}^{d}$ whose generalized derivatives up to order $2$ are in $L_{\text{loc}}^{d}(\mathbb{R}^{d})$ , equipped with its natural semi-norms. We assume that $r-\lambda_{*}$ is negative and bounded from above away from zero on the complement of some compact set. This is always satisfied if $-r$ is an inf-compact function, that is the sublevel sets $\{-r\leq c\}$ are compact (or empty) in $\mathbb{R}^{d}\times\mathbb{U}$ for each $c\in\mathbb{R}$ , or if $r$ is a positive function vanishing at infinity and the process $\{X_{t}\}_{t\geq 0}$ is recurrent under some stationary Markov control. Then there exists a unique positive $\Phi_{\mspace{-2.0mu}*}\in{C}^{2}(\mathbb{R}^{d})$ normalized as $\Phi_{\mspace{-2.0mu}*}(0)=1$ which solves $\widehat{\mathcal{G}}\Phi_{\mspace{-2.0mu}*}=\lambda_{*}\Phi_{\mspace{-2.0mu}*}$ . In other words, the eigenvalue $\lambda_{*}=\lambda_{*}(\widehat{\mathcal{G}})$ is simple. Let ${\varphi_{\mspace{-2.0mu}*}}\coloneqq\log\Phi_{\mspace{-2.0mu}*}$ . As shown in [AABK], the function

[TABLE]

is an infinitesimal relative entropy rate.

We let ${\mathcal{Z}}\coloneqq\mathbb{R}^{d}\times\mathbb{U}\times\mathbb{R}^{d}$ , and use the single variable $z=(x,u,w)\in{\mathcal{Z}}$ . Let ${\mathcal{P}}({\mathcal{Z}})$ denote the set of probability measures on the Borel $\sigma$ -algebra of ${\mathcal{Z}}$ , and ${\mathcal{M}}_{A}$ denote the set of infinitesimal ergodic occupation measures for the operator $\mathcal{A}$ in (4) defined for $f\in{C}^{2}(\mathbb{R}^{d})$ , which here can be written as

[TABLE]

where ${C}^{2}_{c}(\mathbb{R}^{d})$ is the class of functions in ${C}^{2}(\mathbb{R}^{d})$ which have compact support. Recall the definition $R(x,u,w)\coloneqq r(x,u)-\frac{1}{2}\lvert\upsigma^{\mathsf{T}}(x)w\rvert^{2}$ in Section 4. We also define

[TABLE]

The following is a summary of the main results in [AABK, Section 4].

Theorem 3.

We have

[TABLE]

Suppose that the diffusion matrix $a$ is bounded and uniformly elliptic, and either $-r$ is inf-compact, or $\langle b,x\rangle^{-}$ has subquadratic growth, or $\frac{\lvert b\rvert^{2}}{1+\lvert r\rvert}$ is bounded. Then ${\mathcal{M}}_{\mathcal{A}}\cap{{\mathcal{P}}_{\mspace{-3.0mu}\circ}}({\mathcal{Z}})\subset{{\mathcal{P}}_{\mspace{-3.0mu}*}}({\mathcal{Z}})$ , and ${{\mathcal{P}}_{\mspace{-3.0mu}*}}({\mathcal{Z}})$ may be replaced by ${\mathcal{P}}({\mathcal{Z}})$ in the variational formula above. If, in addition, $\frac{{\mathcal{H}}}{1+\lvert{\varphi_{\mspace{-2.0mu}*}}\rvert}$ is bounded, then

[TABLE]

We continue with the Collatz–Wielandt formula in $\mathbb{R}^{d}$ for the risk-sensitive cost minimization problem. This is studied in [AriAnup]. Here, we have a running cost $r(x,u)$ which is bounded from below in $\mathbb{R}^{d}\times\mathbb{U}$ , and is locally Lipschitz in $x$ uniformly in $u$ . The assumptions on $b$ and $\upsigma$ are as stated in the beginning of the section, except that we may replace the affine growth assumption with the more general condition

[TABLE]

for some constant $C_{0}>0$ . The risk-sensitive optimal value $\Lambda^{*}$ is defined by

[TABLE]

The operator ${\mathcal{G}}$ here is as in (3) but for $f\in{C}^{2}(\mathbb{R}^{d})$ , and we let the generalized principal eigenvalue $\lambda_{*}({\mathcal{G}})$ be defined as in (5).

The running cost does not have any structural properties that penalize unstable behavior such as near-monotonicity or inf-compactness, so uniform ergodicity for the controlled process needs to be assumed. Let

[TABLE]

We consider the following hypothesis.

Assumption 1.

The following hold.

(i)

There exists an inf-compact function $\ell\in{C}(\mathbb{R}^{d})$ , and a positive function $\mathscr{V}\in{\mathscr{W}}_{\text{loc}}^{2,d}(\mathbb{R}^{d})$ , satisfying $\inf_{\mathbb{R}^{d}}\mathscr{V}>0$ , such that

[TABLE]

for some constant $\kappa_{1}$ and a compact set $\mathcal{K}$ .

(ii)

The function $x\mapsto\beta\ell(x)-\max_{u\in\mathbb{U}}\,r(x,u)$ is inf-compact for some $\beta\in(0,1)$ .

As noted in [ABS-19], the Foster–Lyapunov equation in (6) cannot in general be satisfied for diffusions with bounded $a$ and $b$ . Therefore, to treat this case, we consider an alternate set of conditions.

Assumption 2.

The following hold.

(i)

There exists a positive function $\mathscr{V}\in{\mathscr{W}}_{\text{loc}}^{2,d}(\mathbb{R}^{d})$ , satisfying $\inf_{\mathbb{R}^{d}}\mathscr{V}>0$ , constants $\kappa_{1}$ and $\gamma>0$ , and a compact set $\mathcal{K}$ such that

[TABLE]

(ii)

$\lVert r^{-}\rVert_{\infty}+\limsup_{\lvert x\rvert\to\infty}\,\max_{u\in\mathbb{U}}\,r(x,u)<\gamma$ .

Let ${\mathfrak{o}}(\mathscr{V})$ denote the class of continuous functions $f$ that grow slower than $\mathscr{V}$ , that is, $\frac{\lvert f(x)\rvert}{\mathscr{V}(x)}\to 0$ as $\lvert x\rvert\to\infty$ . We quote the following result from [ABS-19].

Theorem 4.

Grant either Assumption 1, or 2. Then

[TABLE]

where ${C}^{2,+}(\mathbb{R}^{d})$ denotes the set of positive functions in ${C}^{2}(\mathbb{R}^{d})$ .

We should remark here that the class of test functions $f$ in the first representation formula in (7) cannot, in general, be enlarged to ${C}^{2,+}(\mathbb{R}^{d})$ .

It is also interesting to consider the substitution $f=e^{\psi}$ . Then (7) transforms to

[TABLE]

with

[TABLE]

This underscores the discussion in the last paragraph of section 4.

6. Finite state and action space

For discrete time problems with finite state and action spaces (i.e., $|S|,|U|<\infty$ in sections 2-3), one can go significantly further for the reward maximization problem. We recall below some results in this context from [CDC].

Consider a controlled Markov chain $\{Y_{n}\}$ on $S$ with state-dependent action space at state $i$ given by:

[TABLE]

where

[TABLE]

This is isomorphic to $\mathcal{P}(S)$ . Let

[TABLE]

The (controlled) transition probabilities of $\{Y_{n}\}$ are

[TABLE]

Define the per stage reward $\tilde{r}\colon K\times S\mapsto\mathcal{R}$ by:

[TABLE]

Let $\{(Z_{n},Q_{n}),n\geq 0\}$ denote the $\tilde{U}_{Y_{n}}$ -valued control process. Consider the problem: Maximize the long run average reward

[TABLE]

Define the corresponding ergodic occupation measure $\gamma\in\mathcal{P}(K\times S)$ by

[TABLE]

where $\gamma_{1}$ is an invariant probability distribution (not necessarily unique) under the transition kernel

[TABLE]

Let $\mathcal{E}$ denote the set of such $\gamma$ . The above average reward control problem is equivalent to the linear program:

P0 Maximize

[TABLE]

over $\mathcal{E}$ .

Recall that $\mathcal{E}$ is specified by linear constraints and its extreme points correspond to stationary Markov policies ([BorkarMC], Chapter V). The maximum will be attained at an extreme point of $\mathcal{E}$ corresponding to a stationary Markov policy. This LP can be simplified as:

Maximize

[TABLE]

over

[TABLE]

The dual LP is:

Minimize $\breve{\lambda}$ subject to

[TABLE]

The proof goes through finite approximations. Note that the LP has infinitely many constraints. However, it does pave the way for the corresponding dynamic programming principle. The dynamic programming formulation equivalent to the above LP turns out to be as follows:

[TABLE]

for all $i\in S$ , where $B_{i}$ is the Argmax in ( $\dagger$ ). Once again, the proof goes through finite approximations. The maximization over $q$ in ( $\dagger$ ) can be explicitly performed using the ‘Gibbs variational principle’ from statistical mechanics. For fixed $i,u,$ the maximum is attained at

[TABLE]

Substitute back, setting

[TABLE]

and exponentiate both sides of ( $\dagger$ ). This leads to the multiplicative dynamic programming equations for infinite horizon risk-sensitive reward in the general degenerate case:

[TABLE]

for all $i\in S$ , where $D_{i}$ is the Argmax in ( $\dagger\dagger$ ). This is the analog of the Howard–Kallenberg results for ergodic or ‘average reward’ control ([Puterman], Chapter 9). Observe the occurrence of the ‘twisted kernel’, which sets it apart from the average reward case.

7. Future directions

There are several directions left uncharted in this broad problem area. Some of them are listed below.

(1)

There are some in-between cases that need to be analyzed, e.g., controlled Markov chains with countably infinite state space. Under the strong ‘Doeblin condition’, the abstract Collatz-Wielandt formula has been derived for these in [Cavazos]. This needs to be extended to more general cases. 2. (2)

The counterpart of the dynamic programming equations derived for reducible risk-sensitive reward processes can also be expected to hold for risk-sensitive cost problems and is yet to be established. 3. (3)

Concrete computational schemes based on approximate concave maximization problems is another direction worth pursuing.

Acknowledgements

The work of A.A. was supported in part by the National Science Foundation through grant DMS-1715210, and in part the Army Research Office through grant W911NF-17-1-001. The work of V.S.B. was supported by a J. C. Bose Fellowship from the Government of India.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

‘Controlled’ versions of the Collatz–Wielandt

Abstract.

Key words and phrases:

Key words and phrases:

2000 Mathematics Subject Classification:

1. Introduction

2. Discrete time problems

Theorem 1**.**

3. Variational Formula

Theorem 2**.**

4. Reflected diffusions

5. Diffusions on the whole space

Theorem 3**.**

Assumption 1**.**

Assumption 2**.**

Theorem 4**.**

6. Finite state and action space

7. Future directions

Acknowledgements

References

Theorem 1.

Theorem 2.

Theorem 3.

Assumption 1.

Assumption 2.

Theorem 4.