A Variational Characterization of R\'enyi Divergences

Venkat Anantharam

arXiv:1701.07796·cs.IT·January 27, 2017

A Variational Characterization of R\'enyi Divergences

Venkat Anantharam

PDF

TL;DR

This paper presents a new variational characterization of Re9nyi divergences between probability distributions and Markov chains, linking them to relative entropies and extending existing formulas.

Contribution

It develops a novel variational formula for Re9nyi divergences using relative entropies, applicable to both probability distributions and Markov chains.

Findings

01

Derived a variational formula for Re9nyi divergences between distributions.

02

Extended the variational characterization to stationary finite state Markov chains.

03

Connected the results with Varadhan's variational formula for spectral radius.

Abstract

Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of R\'enyi divergences. We develop a variational characterization of the R\'enyi divergences between two probability distributions on a measurable sace in terms of relative entropies. When combined with the elementary variational formula for exponential integrals of bounded measurable functions in terms of relative entropy, this yields the variational formula of Atar, Chowdhary and Dupuis as a corollary. We also develop an analogous variational characterization of the R\'enyi divergence rates between two stationary finite state Markov chains in terms of relative entropy rates. When combined with Varadhan's variational characterization of the spectral radius of square matrices with nonnegative entries in terms of relative entropy, this yields an…

Equations219

D(\nu\|\theta):=\begin{cases}\int_{S}\left(\log\frac{d\nu}{d\theta}\right)d\nu~{},&\mbox{ if $\nu\preceq\theta$,}\\ \infty&\mbox{ if $\nu\npreceq\theta$.}\end{cases}

D(\nu\|\theta):=\begin{cases}\int_{S}\left(\log\frac{d\nu}{d\theta}\right)d\nu~{},&\mbox{ if $\nu\preceq\theta$,}\\ \infty&\mbox{ if $\nu\npreceq\theta$.}\end{cases}

lo g \int_{S} e^{g} d μ = θ \in P (S) sup (\int_{S} g d θ - D (θ ∥ μ)) = θ \in P (S) : θ ⪯ μ sup (\int_{S} g d θ - D (θ ∥ μ)) .

lo g \int_{S} e^{g} d μ = θ \in P (S) sup (\int_{S} g d θ - D (θ ∥ μ)) = θ \in P (S) : θ ⪯ μ sup (\int_{S} g d θ - D (θ ∥ μ)) .

R_{\alpha}(\nu\|\theta):=\begin{cases}\infty&\mbox{ if $\alpha>1$ and $\nu\npreceq\theta$}\\ \frac{1}{\alpha(\alpha-1)}\log\int_{\{\nu^{\prime}\theta^{\prime}>0\}}(\frac{\nu^{\prime}}{\theta^{\prime}})^{\alpha}d\theta&\mbox{ otherwise}~{},\end{cases}

R_{\alpha}(\nu\|\theta):=\begin{cases}\infty&\mbox{ if $\alpha>1$ and $\nu\npreceq\theta$}\\ \frac{1}{\alpha(\alpha-1)}\log\int_{\{\nu^{\prime}\theta^{\prime}>0\}}(\frac{\nu^{\prime}}{\theta^{\prime}})^{\alpha}d\theta&\mbox{ otherwise}~{},\end{cases}

R_{α} (ν ∥ θ) := R_{1 - α} (θ ∥ ν) .

R_{α} (ν ∥ θ) := R_{1 - α} (θ ∥ ν) .

R_{\alpha}(\nu\|\theta)=\frac{1}{\alpha(\alpha-1)}\log\int_{S}(\nu^{\prime})^{\alpha}(\theta^{\prime})^{1-\alpha}d\eta~{},~{}~{}\mbox{ for all $\alpha\in\mathbb{R}\backslash\{0,1\}$.}

R_{\alpha}(\nu\|\theta)=\frac{1}{\alpha(\alpha-1)}\log\int_{S}(\nu^{\prime})^{\alpha}(\theta^{\prime})^{1-\alpha}d\eta~{},~{}~{}\mbox{ for all $\alpha\in\mathbb{R}\backslash\{0,1\}$.}

\int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ

\int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν} sup (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)),

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν} sup (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)),

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν, μ ⪯ θ} in f (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)),

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν, μ ⪯ θ} in f (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)),

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ θ} sup (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)) .

R_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ θ} sup (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)) .

Λ_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν \mbox or μ ⪯ θ} sup ((α - 1) D (μ ∥ θ) - α D (μ ∥ ν)),

Λ_{α} (ν ∥ θ) = {μ \in P (S) : μ ⪯ ν \mbox or μ ⪯ θ} sup ((α - 1) D (μ ∥ θ) - α D (μ ∥ ν)),

\frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν = θ \in P (S) in f (\frac{1}{α} lo g \int_{S} e^{α g} d θ + R_{α} (ν ∥ θ)),

\frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν = θ \in P (S) in f (\frac{1}{α} lo g \int_{S} e^{α g} d θ + R_{α} (ν ∥ θ)),

\frac{1}{α} lo g \int_{S} e^{α g} d θ = ν \in P (S) sup (\frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν - R_{α} (ν ∥ θ)) .

\frac{1}{α} lo g \int_{S} e^{α g} d θ = ν \in P (S) sup (\frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν - R_{α} (ν ∥ θ)) .

- \frac{1}{β} lo g \int_{S} e^{β h} d ν = θ \in P (S) in f (\frac{1}{1 - β} lo g \int_{S} e^{(β - 1) h} d θ + R_{1 - β} (ν ∥ θ)),

- \frac{1}{β} lo g \int_{S} e^{β h} d ν = θ \in P (S) in f (\frac{1}{1 - β} lo g \int_{S} e^{(β - 1) h} d θ + R_{1 - β} (ν ∥ θ)),

\frac{1}{β} lo g \int_{S} e^{β h} d ν = θ \in P (S) sup (\frac{1}{β - 1} lo g \int_{S} e^{(β - 1) h} d θ - R_{β} (θ ∥ ν)),

\frac{1}{β} lo g \int_{S} e^{β h} d ν = θ \in P (S) sup (\frac{1}{β - 1} lo g \int_{S} e^{(β - 1) h} d θ - R_{β} (θ ∥ ν)),

\frac{1}{α} lo g \int_{S} e^{α g} d θ \geq \frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν - R_{α} (ν ∥ θ) .

\frac{1}{α} lo g \int_{S} e^{α g} d θ \geq \frac{1}{α - 1} lo g \int_{S} e^{(α - 1) g} d ν - R_{α} (ν ∥ θ) .

μ_{K}^{'} := \frac{1}{Z _{K}} (ν^{'})^{α} (θ^{'})^{1 - α} 1 ((ν^{'})^{α} (θ^{'})^{1 - α} \leq K)

μ_{K}^{'} := \frac{1}{Z _{K}} (ν^{'})^{α} (θ^{'})^{1 - α} 1 ((ν^{'})^{α} (θ^{'})^{1 - α} \leq K)

Z_{K} := \int_{{(ν^{'})^{α} (θ^{'})^{1 - α} \leq K}} (ν^{'})^{α} (θ^{'})^{1 - α} d η,

Z_{K} := \int_{{(ν^{'})^{α} (θ^{'})^{1 - α} \leq K}} (ν^{'})^{α} (θ^{'})^{1 - α} d η,

\frac{1}{α} D (μ_{K} ∥ θ) - \frac{1}{α - 1} D (μ_{K} ∥ ν)

\frac{1}{α} D (μ_{K} ∥ θ) - \frac{1}{α - 1} D (μ_{K} ∥ ν)

= \frac{1}{α} \int_{{μ_{K}^{'} > 0}} (lo g \frac{μ _{K}^{'}}{θ ^{'}}) d μ_{K} - \frac{1}{α - 1} \int_{{μ_{K}^{'} > 0}} (lo g \frac{μ _{K}^{'}}{ν ^{'}}) d μ_{K}

= \frac{1}{α} \int_{{μ_{K}^{'} > 0}} (lo g \frac{( ν ^{'} ) ^{α}}{Z _{K} ( θ ^{'} ) ^{α}}) d μ_{K} - \frac{1}{α - 1} \int_{{μ_{K}^{'} > 0}} (lo g \frac{( θ ^{'} ) ^{1 - α}}{Z _{K} ( ν ^{'} ) ^{1 - α}}) d μ_{K}

= \frac{1}{α ( α - 1 )} lo g Z_{K},

R_{α} (ν ∥ θ) = \frac{1}{α ( α - 1 )} lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ = \frac{1}{α ( α - 1 )} lo g \int_{S} (ν^{'})^{α} (θ^{'})^{1 - α} d η .

R_{α} (ν ∥ θ) = \frac{1}{α ( α - 1 )} lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ = \frac{1}{α ( α - 1 )} lo g \int_{S} (ν^{'})^{α} (θ^{'})^{1 - α} d η .

R_{α} (ν ∥ θ) \geq \frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν) .

R_{α} (ν ∥ θ) \geq \frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν) .

(α - 1) \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{μ ^{'}}{θ ^{'}} d μ - α \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{μ ^{'}}{ν ^{'}} d μ = \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{( ν ^{'} ) ^{α} ( θ ^{'} ) ^{1 - α}}{μ ^{'}} d μ .

(α - 1) \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{μ ^{'}}{θ ^{'}} d μ - α \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{μ ^{'}}{ν ^{'}} d μ = \int_{{ν^{'} θ^{'} μ^{'} > 0}} lo g \frac{( ν ^{'} ) ^{α} ( θ ^{'} ) ^{1 - α}}{μ ^{'}} d μ .

α (α - 1) R_{α} (ν ∥ θ) = lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ \geq lo g \int_{{ν^{'} θ^{'} μ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} \frac{θ ^{'}}{μ ^{'}} d μ,

α (α - 1) R_{α} (ν ∥ θ) = lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ \geq lo g \int_{{ν^{'} θ^{'} μ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} \frac{θ ^{'}}{μ ^{'}} d μ,

R_{α} (ν ∥ θ) := \frac{1}{α ( α - 1 )} lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ = \infty .

R_{α} (ν ∥ θ) := \frac{1}{α ( α - 1 )} lo g \int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ = \infty .

\int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ

\int_{{ν^{'} θ^{'} > 0}} (\frac{ν ^{'}}{θ ^{'}})^{α} d θ

\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)

\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν)

= \frac{1}{α} \int_{{μ^{'} > 0}} (lo g \frac{μ ^{'}}{θ ^{'}}) d μ - \frac{1}{α - 1} \int_{{μ^{'} > 0}} (lo g \frac{μ ^{'}}{ν ^{'}}) d μ

= \frac{1}{α} \int_{{μ^{'} > 0}} (lo g \frac{( ν ^{'} ) ^{α}}{Z ( θ ^{'} ) ^{α}}) d μ - \frac{1}{α - 1} \int_{{μ^{'} > 0}} (lo g \frac{( θ ^{'} ) ^{1 - α}}{Z ( ν ^{'} ) ^{1 - α}}) d μ

= \frac{1}{α ( α - 1 )} lo g Z,

R_{α} (ν ∥ θ) \leq \frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν) .

R_{α} (ν ∥ θ) \leq \frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν) .

α (1 - α) (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν))

α (1 - α) (\frac{1}{α} D (μ ∥ θ) - \frac{1}{α - 1} D (μ ∥ ν))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A VARIATIONAL CHARACTERIZATION OF RÉNYI DIVERGENCES

VENKAT ANANTHARAM111EECS Department, University of California, Berkeley, CA 94720, USA. Research supported in part by the National Science Foundation grants ECCS-1343398, CNS-1527846, CCF-1618145, the NSF Science & Technology Center grant CCF-0939370 (Science of Information), and the William and Flora Hewlett Foundation Center for Long Term Cybersecurity at Berkeley.

ABSTRACT: Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We develop a variational characterization of the Rényi divergences between two probability distributions on a measurable space in terms of relative entropies. When combined with the elementary variational formula for exponential integrals of bounded measurable functions in terms of relative entropy, this yields the variational formula of Atar, Chowdhary and Dupuis as a corollary.

We also develop an analogous variational characterization of the Rényi divergence rates between two stationary finite state Markov chains in terms of relative entropy rates. When combined with Varadhan’s variational characterization of the spectral radius of square matrices with nonnegative entries in terms of relative entropy, this yields an analog of the variational formula of Atar, Chowdary and Dupuis in the framework of finite state Markov chains.

Key words: Markov chains; Relative entropy; Rényi divergence; Variational formulas.

1 Introduction

Evaluating how far away a given probability distribution is from another can be done in many ways. The Kullback-Leibler divergence or relative entropy, which is closely tied to Shannon’s notion of entropy, is one such measure prominent in statistical applications. It belongs to a larger family of divergences, the so-called Rényi divergences, which are closely tied to Rényi’s notion of entropy. Rényi divergences also have numerous applications in problems of interest in statistics and information theory, see [6] for a survey of some of their basic properties and some indication of their applications. The Rényi divergences, with a minor change in scaling relative to the definition in [6], are the topic of this article. We treat the Rényi divergences as parametrized by a real number $\alpha\in\mathbb{R}$ , $\alpha\neq 0$ , $\alpha\neq 1$ .

We were prompted to write this document by reading a recent paper of Atar, Chowdhary and Dupuis [3], which provides a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We show that the variational characterization in [3] is a simple consequence of a variational characterization for Rényi divergences in terms of relative entropies, which we also develop. For the case of probability distributions on a finite set, and in the range $\alpha>0$ , $\alpha\neq 1$ , our variational characterization for Rényi divergences was developed by Shayevitz, [12] and [13, Thm. 1]. More recently, for mutually absolutely continuous probability distributions on a measurable space, in the case $\alpha>0$ , $\alpha\neq 1$ , parts of this variational characterization appear in a paper of Sason, see [10, Lem. 4 and Cor. 2]. The ability to derive the variational formula of [3] from inequalities for the Rényi divergences in terms of relative entropies, in the case $\alpha>1$ , is also remarked on in a recent paper of Liu, Courtade, Cuff, and Verdú [8, Sec. II-A]. To the best of our knowledge, however, a full treatment of this variational characterization of Rényi divergences in terms of relative entropies, covering an arbitrary pair of probability distributions on a measurable space and all possible values for $\alpha$ , does not appear to be in the literature and so it seems worth writing down. It is also worth noting how easily the full variational formula of [3], in all cases, falls out of this variational characterization of Rényi divergences.

Section 2 presents the notational conventions and the definitions of the main quantities used in this document in the i.i.d. case. The main result in the i.i.d. case, Theorem 1, is stated in Section 3. The result of [3] that prompted this paper is presented in Section 4, and is derived there as a consequence of Theorem 1 and the elementary variational formula for exponential integrals in (2). Theorem 1 itself is proved in Section 5.

We then turn to a development of analogs of the preceding results in the case of stationary finite state Markov chains. Section 6 makes the necessary definitions and gathers some standard facts about the asymptotic properties of iterated powers of a square matrix with nonnegative entries, which we need for our discussion. It also contains the analog of the elementary variational formula in the context of finite state Markov chains, in (6), which is Varadhan’s variational characterization in terms of relative entropy of the spectral radius of square matrices with nonnegative entries. The main results in the case of stationary finite state Markov chains are stated in Section 7. These are Theorem 2, which gives a variational characterization of each Rényi divergence rate between two stationary finite state Markov chains in terms of relative entropy rates, and Theorem 3, which gives an analog of the variational formula of [3] in the context of finite state Markov chains. A proof of Theorem 3 assuming the truth of Theorem 2, and using (6), is also provided in this section. The proof of Theorem 2 is provided in Section 8. We end the paper in Section 9 with some thoughts about directions for future work.

In order to maintain the flow of the main exposition, the details of several proofs are relegated to appendices.

2 Setup

Let $(S,\mathcal{F})$ be a measurable space. $\mathcal{B}(S)$ denotes the set of bounded measurable real-valued functions and $\mathcal{P}(S)$ the set of probability measures on $(S,\mathcal{F})$ . For $\nu,\theta\in\mathcal{P}(S)$ , $\nu\preceq\theta$ is notation for $\nu$ being absolutely continuous with respect to $\theta$ , see [4, pg. 442] for the definition. If $\nu\preceq\theta$ , then $\frac{d\nu}{d\theta}$ denotes the Radon-Nikodym derivative of $\nu$ with respect to $\theta$ ; any two choices of Radon-Nikodym derivative differ only on a $\theta$ -null set, see [4, Thm. 32.2]. The relative entropy $D(\nu\|\theta)$ of $\nu$ with respect to $\theta$ is defined by

[TABLE]

From the convexity of the $x\log x$ function for nonnegative $x$ , one can check that $D(\nu\|\theta)\geq 0$ .

Here, and in the rest of the paper, $:=$ is notation for equality by definition. Logarithms can be assumed to be to the natural base. For two measurable functions $f$ and $g$ on $(S,\mathcal{F})$ , not necessarily bounded, and $\eta\in\mathcal{P}(S)$ , $f=_{\eta}g$ denotes equality of $f$ and $g$ except possibly on an $\eta$ -null set. Similarly, for $C,D\in\mathcal{F}$ , $C=_{\eta}D$ denotes equality of $C$ and $D$ up to $\eta$ -null sets and $C\subseteq_{\eta}D$ denotes the containment of $C$ in $D$ up to $\eta$ -null sets.

The variational characterization in (2) below of exponential integrals of bounded measurable functions is elementary. For any $\mu\in\mathcal{P}(S)$ and $g\in\mathcal{B}(S)$ we have

[TABLE]

We provide a proof in Appendix A.

For any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , and $\nu,\theta\in\mathcal{P}(S)$ , the Rényi divergence $R_{\alpha}(\nu\|\theta)$ is defined as in eqn. (2.1) of [3], by first defining it for $\alpha>0$ , $\alpha\neq 1$ , by

[TABLE]

where $\nu^{\prime}:=\frac{d\nu}{d\eta}$ and $\theta^{\prime}:=\frac{d\theta}{d\eta}$ , where $\eta\in\mathcal{P}(S)$ is an arbitrary probability distribution such that $\nu\preceq\eta$ and $\theta\preceq\eta$ . It is straightforward to check that every choice of $\eta$ , subject to the absolute continuity conditions, results in the same value of the Rényi entropy. Then, for $\alpha<0$ , we use the definition

[TABLE]

Remark 1.

Even though the definition of $R_{\alpha}(\nu\|\theta)$ is broken up into cases above, a single formula would work, if suitably interpreted. One could write

[TABLE]

In this formula, if $\eta(\nu^{\prime}>0,\theta^{\prime}=0)>0$ and $\alpha>1$ , then because $(\nu^{\prime})^{\alpha}(\theta^{\prime})^{1-\alpha}=\frac{(\nu^{\prime})^{\alpha}}{(\theta^{\prime})^{\alpha-1}}=\infty$ on this event, we are forced to intepret $R_{\alpha}(\nu\|\theta)$ as being $\infty$ . A similar argument forces us to interpret $R_{\alpha}(\nu\|\theta)$ as $\infty$ if $\eta(\nu^{\prime}=0,\theta^{\prime}>0)>0$ and $\alpha<0$ . Rather than requiring of the reader the mental gymnastics needed to keep track of such interpretations, we prefer to break the discussion up into cases.

Remark 2.

It is clear that $R_{\alpha}(\nu\|\theta)\geq 0$ (possibly $\infty$ ) if $\alpha>1$ or $\alpha<0$ . For $0<\alpha<1$ , an application of Hölder’s inequality with $p:=\frac{1}{\alpha}$ and $q:=\frac{1}{1-\alpha}$ (so $\frac{1}{p}+\frac{1}{q}=1$ ) gives

[TABLE]

Hence we also have $R_{\alpha}(\nu\|\theta)\geq 0$ (possibly $\infty$ ) if $0<\alpha<1$ . Note in particular that if $\eta(\nu^{\prime}\theta^{\prime}>0)=0$ , then $R_{\alpha}(\nu\|\theta)=\infty$ for all $\alpha\in\mathbb{R}\backslash\{0,1\}$ .

3 Statement of the main result in the i.i.d. case

Our main result in the i.i.d case is the following variational characterization of Rényi divergence.

Theorem 1.

Let $\alpha\in\mathbb{R}\backslash\{0,1\}$ and $\nu,\theta\in\mathcal{P}(S)$ . Then, if $\alpha>1$ , we have

[TABLE]

while, if $0<\alpha<1$ , we have

[TABLE]

and, if $\alpha<0$ , we have

[TABLE]

Further, when $0<\alpha<1$ , one can find $\mu\in\mathcal{P}(S)$ , $\mu\preceq\nu$ , $\mu\preceq\theta$ , achieving the infimum on the RHS of (6), whenever $\{\mu\in\mathcal{P}(S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}$ is nonempty. $\Box$

Remark 3.

The case by case structure of this result is partly a consequence of the normalization chosen for the Rényi divergences (which is necessary to make Rényi divergence nonnegative) and partly a consequence of the need to apply the correct absolute continuity conditions. If it considered desirable to write a singe formula covering all cases, this can be done by considering $\Lambda_{\alpha}(\nu\|\theta):=\alpha(\alpha-1)R_{\alpha}(\nu\|\theta)$ , for $\alpha\in\mathbb{R}\backslash\{0,1\}$ . Then one has the single formula

[TABLE]

for all $\alpha\in\mathbb{R}\backslash\{0,1\}$ . Note, however, that the set over which the supremum is being taken need not be convex in general. This is essential to avoid encountering expressions of the form $\infty-\infty$ .

4 Discussion

Atar, Chowdhary and Dupuis [3] have recently established a variational formula for exponential integrals of bounded measurable functions. This is established in two forms. For any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $\nu\in\mathcal{P}(S)$ , and $g\in\mathcal{B}(S)$ , eqn. (2.6) of [3] states that

[TABLE]

while eqn. (2.7) of [3] states that for any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $\theta\in\mathcal{P}(S)$ , and $g\in\mathcal{B}(S)$ we have

[TABLE]

It is straightforward to exhibit the equivalence of these two forms. For instance, assuming (8), let $\beta:=1-\alpha$ and $h:=-g$ , and conclude that for all $\beta\in\mathbb{R}\backslash\{0,1\}$ , $\nu\in\mathcal{P}(S)$ , and $h\in\mathcal{B}(S)$ we have

[TABLE]

or equivalently that

[TABLE]

which is (9). One can similarly go in the opposite direction. We will therefore focus only on the form in (9). As observed in Remark 2.3 of [3], taking the limit as $\alpha\to 1$ in (9) recovers the elementary variational formula for exponential integrals of bounded measurable functions in (2).

The structure of Theorem 1 is motivated by the variational characterization in (9). We will now demonstrate that Theorem 1 is at least as strong as (9) by deriving (9) from Theorem 1 and the elementary variational formula (2).

First of all, we show that for any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $\theta\in\mathcal{P}(S)$ , and $g\in\mathcal{B}(S)$ one can find $\nu\in\mathcal{P}(S)$ achieving the supremum in (9). This proof does not depend on Theorem 1 and (2). In fact, the supremum is achieved by the choice $\frac{1}{Z}e^{-g}d\nu=d\theta$ , where $Z$ is the normalization factor, and it is elementary to prove this. For completeness, a proof is included in Appendix B.

It remains to prove that for any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $g\in\mathcal{B}(S)$ , and $\theta,\nu\in\mathcal{P}(S)$ , we have

[TABLE]

Assuming the truth of Theorem 1, and using (2), this is proved in Appendix C.

5 Proof of Theorem 1

We now prove Theorem 1.

Consider first the case $\alpha>1$ . Suppose $\nu\npreceq\theta$ . Then the LHS of (5) is $\infty$ . Also, in this case, we can choose $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ but $\mu\npreceq\theta$ , which makes the RHS of (5) also equal to $\infty$ . Thus we may assume that $\nu\preceq\theta$ . Given $K>0$ sufficiently large, define $\mu_{K}\in\mathcal{P}(S)$ by

[TABLE]

where $\eta\in\mathcal{M}(S\times S)$ is chosen such that $\theta\preceq\eta$ , and we define $\nu^{\prime}:=\frac{d\nu}{d\eta}$ , $\theta^{\prime}:=\frac{d\theta}{d\eta}$ , and $\mu^{\prime}_{K}:=\frac{d\mu_{K}}{d\eta}$ . Further,

[TABLE]

and $K$ sufficiently large means that $Z_{K}>0$ . We note that $\mu_{K}\preceq\nu$ (and so $\mu_{K}\preceq\theta$ ). Then

[TABLE]

which, as $K\to\infty$ , converges to

[TABLE]

It remains to show that, in the case $\alpha>1$ , for all $\nu,\theta\in\mathcal{P}(S)$ such that $\nu\preceq\theta$ , we have, for all $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ , the inequality

[TABLE]

Pick $\eta\in\mathcal{P}(S)$ such that $\theta\preceq\eta$ (so we also have $\nu\preceq\eta$ and $\mu\preceq\eta$ ), and let $\nu^{\prime}:=\frac{d\nu}{d\eta}$ , $\theta^{\prime}:=\frac{d\theta}{d\eta}$ , and $\mu^{\prime}:=\frac{d\mu}{d\eta}$ . Multiplying the RHS of (11) by $\alpha(\alpha-1)$ gives

[TABLE]

On the other hand, we have

[TABLE]

so (11) follows from the concavity of the logarithm.

Next, consider the case when $0<\alpha<1$ . Pick $\eta\in\mathcal{P}(S)$ such that $\nu\preceq\eta$ and $\theta\preceq\eta$ , and let $\nu^{\prime}:=\frac{d\nu}{d\eta}$ and $\theta^{\prime}:=\frac{d\theta}{d\eta}$ . If $\{\nu^{\prime}\theta^{\prime}>0\}=_{\eta}\emptyset$ , then $\int_{\{\nu^{\prime}\theta^{\prime}>0\}}(\frac{\nu^{\prime}}{\theta^{\prime}})^{\alpha}d\theta=0$ , and so

[TABLE]

But we also have $\{\mu\in\mathcal{P}(S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}=\emptyset$ , so the RHS of (6) equals $\infty$ . We may therefore assume that $\eta(\nu^{\prime}\theta^{\prime}>0)>0$ . Now, an application of Hölder’s inequality with $p:=\frac{1}{\alpha}$ and $q:=\frac{1}{1-\alpha}$ (so $\frac{1}{p}+\frac{1}{q}=1$ ) gives

[TABLE]

Let $\mu\in\mathcal{P}(S)$ be defined by $\mu^{\prime}:=\frac{1}{Z}(\nu^{\prime})^{\alpha}(\theta^{\prime})^{1-\alpha}$ , where $Z:=\int_{S}(\nu^{\prime})^{\alpha}(\theta^{\prime})^{1-\alpha}d\eta$ . Note that $R_{\alpha}(\nu\|\theta)=\frac{1}{\alpha(\alpha-1)}\log Z$ . We have $\mu\preceq\nu$ and $\mu\preceq\theta$ , as required on the RHS of (6). Now,

[TABLE]

which equals $R_{\alpha}(\nu\|\theta)$ . It remains to show that, in the case $0<\alpha<1$ , for all $\nu,\theta\in\mathcal{P}(S)$ such that $\eta(\nu^{\prime}\theta^{\prime}>0)>0$ , we have, for all $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ and $\mu\preceq\theta$ , the inequality

[TABLE]

To see this, note that

[TABLE]

where $f(\cdot)$ is the negative logarithm function, which is decreasing and convex. This establishes (12). Note that we have also estabished the claim in Theorem 1 that when $0<\alpha<1$ one can find $\mu$ realizing the infimum in (6) whenever $\{\mu\in\mathcal{P}(S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}$ is nonempty.

It remains to consider the case where $\alpha<0$ . Let $\beta:=1-\alpha$ . Then $\beta>1$ . By definition $R_{\alpha}(\nu\|\theta)=R_{\beta}(\theta\|\nu)$ . However, we have already proved that

[TABLE]

This reads

[TABLE]

which establishes (7) in this case also and completes the proof of Theorem 1.

6 Rényi divergence rate between stationary finite state Markov chains

In this section we set the stage to present analogs of the preceding results involving the Rényi divergence rates between two stationary finite state Markov chains. Extensions to general state space Markov processes in both discrete and continuous time of a form similar to those we will present for stationary finite state Markov chains no doubt exist, under suitable conditions on the transition kernel, but may be considered topics for future work.

From this point onwards in this document we take $S=\{1,\ldots,d\}$ and $\mathcal{F}$ to be comprised of all the subsets of $S$ . Let $\mathcal{M}(S\times S)$ denote the set of Markov probability distributions on $(S\times S,\mathcal{F}\times\mathcal{F})$ , where $\nu\in\mathcal{M}(S\times S)$ if $\nu(i,j)\geq 0$ for all $(i,j)\in S\times S$ , $\sum_{i,j\in S}\nu(i,j)=1$ , and $\nu(k,*)=\nu(*,k)$ for all $k\in S$ , where $\nu(k,*):=\sum_{j\in S}\nu(k,j)$ and $\nu(*,k):=\sum_{i\in S}\nu(i,k)$ . Here $\mathcal{F}\times\mathcal{F}$ is comprised of all the subsets of $S\times S$ .

Given $\nu\in\mathcal{M}(S\times S)$ , let $S_{\nu}:=\{k~{}:~{}\nu(k,*)>0\}$ . $S_{\nu}$ is a subset of $S$ , and is called the support of $\nu$ . For $i\in S_{\nu}$ and $j\in S$ , we define $\nu(j|i):=\frac{\nu(i,j)}{\nu(i,\cdot)}$ . Note that $\nu(j|i)=0$ if $i\in S_{\nu}$ and $j\notin S_{\nu}$ , and $\sum_{j\in S}\nu(j|i)=1$ . For $i\notin S_{\nu}$ , we define $\nu(j|i)=0$ for all $j$ . This may seem strange, but is an important notational convention for the equations we are going to write. Note that $\sum_{j\in S}\nu(j|i)=0$ for $i\notin S_{\nu}$ .

Given $\nu,\theta\in\mathcal{M}(S\times S)$ we say $\nu$ is absolutely continuous with respect to $\theta$ , denoted $\nu\preceq\theta$ , if $\theta(i,j)=0\Rightarrow\nu(i,j)=0$ for all $(i,j)\in S\times S$ . The relative entropy $D(\nu\|\theta)$ of $\nu$ with respect to $\theta$ is defined by

[TABLE]

It can be checked that $D(\nu\|\theta)\geq 0$ .

We need certain basic facts about the asymptotic properties of iterated powers of square matrices with nonnegative entries. We will state these facts in narrative form. Proofs can be extracted from several books that provide standard treatments of the theory of nonnegative matrices or finite state Markov chains, see e.g. [11, Chap. 1].

Let $M=\left[m_{ij}\right]$ be a $d\times d$ matrix with nonnegative entries. Then the limit

[TABLE]

exists, where $m^{(n)}(i,j)$ denotes the $(i,j)$ entry of $M^{n}$ . We can associate to $M$ a directed graph on the vertex set $\{1,\dots,d\}$ , where we have a directed edge from $i$ to $j$ iff $m_{ij}>0$ . This graph may have self loops. Then $\rho(M)=-\infty$ iff this directed graph does not have a directed cycle. Otherwise $\rho(M)$ is finite. We call $\rho(M)$ the growth rate of $M$ .

Suppose $\rho(M)$ is finite. We say $\mu\in\mathcal{M}(S\times S)$ is absolutely continuous with respect to $M$ if $\mu(i,j)>0\Rightarrow m(i,j)>0$ for all $i,j\in S$ Let $\mu_{1},\mu_{2}\in\mathcal{M}(S\times S)$ be absolutely continuous with respect to $M$ . Then so is $\frac{1}{2}(\mu_{1}+\mu_{2})$ . Thus there is a maximum element $\mu\in\mathcal{M}(S\times S)$ that is absolutely continuous with respect to $M$ , in the sense that every other $\nu\in\mathcal{M}(S\times S)$ that is absolutely continuous with respect to $M$ satisfies $\nu\preceq\mu$ . This maximum element need not be unique. Pick any such maximum element, call it $\tau$ . Let $M^{\prime}:=\left[m(i,j)1(i,j\in S_{\tau})\right]$ . Then $\rho(M^{\prime})=\rho(M)$ .

Let $\mu\in\mathcal{M}(S\times S)$ , which we also think of as a nonnegative $d\times d$ matrix. The support of $\mu$ can be uniquely written as a disjoint union of subsets, called classes, $S_{\mu}=\cupdot_{k=1}^{l}C_{k}$ , for some $l\geq 1$ , such that $\mu(i,j)=0$ if $i,j\in S_{\mu}$ are in distinct classes, and such that, for each $1\leq k\leq l$ , if we consider the restriction of the directed graph associated to $\mu$ to the vertices in the class $C_{k}$ , then this directed graph is irreducible, in the sense that there is a directed path in the graph between any pair of vertices in $C_{k}$ .

Given $\mu\in\mathcal{M}(S\times S)$ and a $d\times d$ matrix $M$ with nonnegative entries, we say $M$ is compatible with $\mu$ if $m(i,j)>0\Leftrightarrow\mu(i,j)>0$ . Let $S_{\mu}=\cupdot_{k=1}^{l}C_{k}$ be the decomposition of the support of $\mu$ into classes. For each $1\leq k\leq l$ , the restriction of $M$ to the coordinates in $C_{k}$ defines a $|C_{k}|\times|C_{k}|$ irreducible matrix with nonnegative entries. This matrix has an associated Perron-Frobenius eigenvalue, which we denote by $\lambda_{k}(M)$ . We have $\lambda_{k}(M)>0$ for all $1\leq k\leq l$ . We have $\rho(M)=\log\max_{1\leq k\leq l}\lambda_{k}(M)$ . Also, for each $1\leq k\leq l$ , the restriction of $M$ to the coordinates in $C_{k}$ has a left eigenvector associated to the eigenvalue $\lambda_{k}(M)$ , which has all its coordinates strictly positive and is unique up to scaling, and also a right eigenvector associated to the eigenvalue $\lambda_{k}(M)$ , which has all its coordinates strictly positive and is unique up to scaling.

Given $\nu\in\mathcal{M}(S\times S)$ , what we mean by the stationary Markov chain defined by $\nu$ is the following: for each $n\geq 1$ define a probability distribution $\nu_{n}$ on $(S^{n},\mathcal{F}_{n})$ , where $\mathcal{F}_{n}$ is comprised of all subsets of $S^{n}$ , by setting

[TABLE]

It is straightfoward to check that for all $n\geq 2$ and $\nu,\theta\in\mathcal{M}(S\times S)$ we have

[TABLE]

The following fact, which will be very useful later, is easy to verify from the definitions. It holds for all $\nu,\theta\in\mathcal{M}(S\times S)$ .

[TABLE]

where on the RHS of this defintion the notation $D(\nu_{n}\|\theta_{n})$ refers to the relative entropy between probability distributions on $(S^{n},\mathcal{F}_{n})$ .

We are now in a position where we can state the analog for stationary finite state Markov chains of the elementary variational formula (2). Let $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ and $\mu\in\mathcal{M}(S\times S)$ . We have the following variational characterization of the growth rate of the exponential integral of $G$ along the stationary Markov chain defined by $\mu$ .

[TABLE]

The proof is in Appendix D. The result is standard, being Varadhan’s characterization of the spectral radius of nonnegative matrices, see e.g. [5, Exer. 3.1.19].

We are also in a position to define the Rényi divergence rates between two stationary finite state Markov chains. This definition is classical, see e.g. the paper of Rached, Alajaji, and Campbell [9], which also considers the nonstationary case, and the references therein. Given $\nu,\theta\in\mathcal{M}(S\times S)$ and $\alpha\in\mathbb{R}\backslash\{0,1\}$ , we define the Rényi divergence rate of $\nu$ with respect to $\theta$ , denoted $R_{\alpha}(\nu\|\theta)$ , by

[TABLE]

where on the RHS of this defintion the notation $R_{\alpha}(\nu_{n}\|\theta_{n})$ refers to the Rényi divergence between probability distributions on $(S^{n},\mathcal{F}_{n})$ defined as in (3) and (4). The proofs of the existence of the limit in (18) as well as of the properties of the Rényi divergence rate of interest to us, which are stated in the following proposition, are in Appendix E.

Proposition 1.

Given $\nu,\theta\in\mathcal{M}(S\times S)$ , the Rényi divergence rate, as defined in (18), satisfies the following properties:

[TABLE]

and

[TABLE]

7 Main results in the Markov case

Our first main result in the Markov case is the following variational characterization of the Rényi divergence rate, which is a direct analog of Theorem 1.

Theorem 2.

Let $\alpha\in\mathbb{R}\backslash\{0,1\}$ and $\nu,\theta\in\mathcal{M}(S\times S)$ . Then, if $\alpha>1$ , we have

[TABLE]

while, if $0<\alpha<1$ , we have

[TABLE]

and, if $\alpha<0$ , we have

[TABLE]

Further, one can find $\mu\in\mathcal{M}(S\times S)$ achieving the extremum on the RHS in all three cases, except in the case where $0<\alpha<1$ and $\{\mu\in\mathcal{M}(S\times S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}$ is empty. $\Box$

Our second main result in the Markov case is the following analog of the variational formula of [3].

Theorem 3.

For any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $\nu\in\mathcal{M}(S\times S)$ , and $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ , we have

[TABLE]

and for any $\alpha\in\mathbb{R}\backslash\{0,1\}$ , $\theta\in\mathcal{M}(S\times S)$ , and $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ we have

[TABLE]

It is straightforward to exhibit the equivalence of the claims in (22) and (23). This is done is Appendix F. It therefore suffices to focus only on the form in (23). It is straightforward to show that for each $\theta\in\mathcal{M}(S\times S)$ and $G\in\mathbb{R}^{d\times d}$ , one can find $\nu\in\mathcal{M}(S\times S)$ achieving the supremum on the RHS of (23). Appendix F also contains a demonstration of this fact. A proof of Theorem 3, assuming the truth of Theorem 2, and using (6), is also provided in Appendix F.

8 Proof of Theorem 2

Suppose $\alpha>1$ . If $\nu\npreceq\theta$ , taking $\mu=\nu$ on the RHS of (19) makes the RHS equal $\infty$ , which is also the value of the LHS. We may therefore assume that $\nu\preceq\theta$ .

Let $M:=\left[\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}\right]$ . This matrix is compatible with $\nu$ . Let $S_{\nu}=\cupdot_{k=1}^{l}C_{k}$ be the decomposition of the support of $\nu$ into classes. We may choose the indexing of the classes in such a way that $\rho(M)=\log\lambda_{1}(M)$ .

Let $u$ be a $1\times d$ row vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero left eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $u$ in the coordinates in $C_{1}$ are strictly positive. Similarly, let $w$ be a $d\times 1$ column vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero right eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $w$ in the coordinates in $C_{1}$ will be strictly positive. For $i,j\in S$ , we define

[TABLE]

where $Z:=\sum_{i,j\in S}u(i)\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}w(j)$ , which is strictly positive. Note that $\mu\in\mathcal{M}(S\times S)$ and $\mu\preceq\nu$ . We also have, for all $i\in S$ ,

[TABLE]

so we get

[TABLE]

where we have used the fact that $S_{\mu}=C_{1}$ .

Multiplying the RHS of (19) by $\alpha(\alpha-1)$ for this choice of $\mu$ gives

[TABLE]

which also equals $\alpha(\alpha-1)$ times the LHS of (19). This establishes the existence of $\mu\in\mathcal{M}(S\times S)$ satisfying $\mu\preceq\nu$ and achieving equality in (19).

It remains to check that for all $\mu\in\mathcal{M}(S\times S)$ satisfying $\mu\preceq\nu$ we have the inequality

[TABLE]

But, in view of (15), in (5) applied to probability distributions on $(S^{n},\mathcal{F}_{n})$ , for $n\geq 2$ , we have already proved that

[TABLE]

Dividing by $n$ , letting $n\to\infty$ , and appealing to (16) establishes (24).

Next, consider the case where $0<\alpha<1$ . If the directed graph associated to the matrix $M^{\prime}:=\left[\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}\right]$ has no cycles, then $R_{\alpha}(\nu\|\theta)=\infty$ , and $\{\mu\in\mathcal{M}(S\times S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}=\emptyset$ , so the RHS of (20) is also $\infty$ , and so (20) holds in this case. We may therefore assume that $\{\mu\in\mathcal{M}(S\times S)~{}:~{}\mu\preceq\nu,\mu\preceq\theta\}$ is nonempty. Pick any $\tau\in\mathcal{M}(S\times S)$ that is a maximum element among all the elements of $\mathcal{M}(S\times S)$ that are absolutely continuous with respect to $M^{\prime}$ . Let $M:=\left[\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}1(i,j\in S_{\tau})\right]$ . Then $\rho(M^{\prime})=\rho(M)$ . Further, $M$ is compatible with $\tau$ .

Let $S_{\tau}=\cupdot_{k=1}^{l}C_{k}$ be the decomposition of the support of $\tau$ into classes. We may choose the indexing of the classes in such a way that $\rho(M)=\log\lambda_{1}(M)$ .

Let $u$ be a $1\times d$ row vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero left eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $u$ in the coordinates in $C_{1}$ are strictly positive. Similarly, let $w$ be a $d\times 1$ column vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero right eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $w$ in the coordinates in $C_{1}$ will be strictly positive. For $i,j\in S$ , we define

[TABLE]

where $Z:=\sum_{i,j\in S}u(i)\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}w(j)$ , which is strictly positive. Note that $\mu\in\mathcal{M}(S\times S)$ and $\mu\preceq\tau$ , so $\mu\preceq\nu$ and $\mu\preceq\theta$ . We also have, for all $i\in S$ ,

[TABLE]

so we get

[TABLE]

where we have used the fact that $S_{\tau}=C_{1}$ .

Multiplying the RHS of (20) by $\alpha(1-\alpha)$ for this choice of $\mu$ gives

[TABLE]

which also equals $\alpha(1-\alpha)$ times the LHS of (20). This establishes the existence of $\mu\in\mathcal{M}(S\times S)$ satisfying $\mu\preceq\nu$ and $\mu\preceq\theta$ and achieving equality in (20).

It remains to check that for all $\mu\in\mathcal{M}(S\times S)$ satisfying $\mu\preceq\nu$ and $\mu\preceq\theta$ we have the inequality

[TABLE]

But, in view of (15), in (6) applied to probability distributions on $(S^{n},\mathcal{F}_{n})$ , for $n\geq 2$ , we have already proved that

[TABLE]

Dividing by $n$ , letting $n\to\infty$ , and appealing to (16) establishes (25).

It remains to consider the case $\alpha<0$ . Let $\beta:=1-\alpha$ . Then $\beta>1$ . By definition $R_{\alpha}(\nu\|\theta)=R_{\beta}(\theta\|\nu)$ . However, we have already proved that

[TABLE]

This reads

[TABLE]

which establishes (21) in this case also and completes the proof of Theorem 2.

9 Concluding remarks

We have given a variational characterization of Rényi divergence between two arbitrary probability distributions on an arbitrary measurable space in terms of relative entropies, for all values of the parameter defining the Rényi divergence. We also gave a variational characterization of the Rényi divergence rate between two stationary finite state Markov chains in terms of relative entropy rates, for all values of the parameter defining the Rényi divergence rate. A consequence of the latter development was an analog of the variational formula of [3] for stationary finite state Markov chains.

While we restricted ourselves to stationary finite state Markov chains in the latter discussion, it is to be expected that there will be versions of this variational characterization of Rényi divergence rate in a much broader setting involving Markov or $k$ -th order Markov processes in discrete time, and also in continuous time. It would also be interesting to consider to what extent such a variational characterization might generalize to the Rényi divergence rates between an arbitrary pair of stationary processes, assuming the existence of the defining limit to start with, since even the understanding of the relative entropy rate at this level of generality is somewhat limited [7].

Acknowledgments

Thanks to Vivek Borkar and Payam Delgosha for their comments on a earlier draft of this document.

Appendix A Proof of the elementary variational formula in (2)

The second equality in (2) follows from the fact that $D(\theta\|\mu)=\infty$ if $\theta\npreceq\mu$ .

Given $\mu\in\mathcal{P}(S)$ and $g\in\mathcal{B}(S)$ , define $\theta\in\mathcal{P}(S)$ by $d\theta=\frac{1}{Z}e^{g}d\mu$ , where $Z:=\int_{S}e^{g}d\mu$ . Note that $\theta\preceq\mu$ . Then

[TABLE]

which also equals of the LHS of (2).

It remains to show that for all $\theta\preceq\mu$ we have

[TABLE]

Let $\theta^{\prime}:=\frac{d\theta}{d\mu}$ . We have

[TABLE]

where the second step is justified by the concavity of the logarithm. This completes the proof. $\Box$

Appendix B Proof that the supremum in (9) is achieved

Given $\theta\in\mathcal{P}(S)$ and $g\in\mathcal{B}(S)$ , let $\nu\in\mathcal{P}(S)$ be defined by $\frac{1}{Z}e^{-g}d\nu=d\theta$ , where $Z:=\frac{1}{\int e^{g}d\theta}$ . Note that $\nu$ and $\theta$ are mutually absolutely continuous.

Thus, for all $\alpha>0$ , $\alpha\neq 1$ , we have

[TABLE]

On the other hand

[TABLE]

which is the same.

Suppose now that $\alpha<0$ . Let $\beta:=1-\alpha$ . Then $\beta>1$ . For any $\theta\in\mathcal{P}(S)$ and $g\in\mathcal{B}(S)$ , let $\nu\in\mathcal{P}(S)$ be defined by $\frac{1}{Z}e^{-g}d\nu=d\theta$ . Then $\frac{1}{W}e^{-h}d\theta=d\nu$ , where $h:=-g$ and $W=\frac{1}{\int_{S}e^{h}d\nu}=\frac{1}{Z}$ . We have then already proved that

[TABLE]

which completes the proof. $\Box$

Appendix C Proof of (10)

Consider first the case $\alpha>1$ . We may then assume that $\nu\preceq\theta$ , since otherwise the right hand side of (10) is $-\infty$ . From (2), we have, for all $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ that

[TABLE]

From (5) we have

[TABLE]

which means that

[TABLE]

Taking the supremum over $\mu\preceq\nu$ on the RHS of the preceding equation and using (2) gives

[TABLE]

which was to be shown.

Next, suppose $0<\alpha<1$ . Given $g\in\mathcal{B}(S)$ and $\nu,\theta\in\mathcal{P}(S)$ , if $\{\nu^{\prime}\theta^{\prime}>0\}=_{\eta}\emptyset$ for some (and hence every) $\eta\in\mathcal{P}(S)$ such that $\nu\preceq\eta$ and $\theta\preceq\eta$ (where $\nu^{\prime}:=\frac{d\nu}{d\eta}$ and $\theta^{\prime}:=\frac{d\theta}{d\eta}$ ), then $R_{\alpha}(\nu\|\theta)=\infty$ , and so (10) is true. Otherwise, we can find $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ and $\mu\preceq\theta$ . We know from the elementary variational formula (2) that for every $\mu\in\mathcal{P}(S)$ we have

[TABLE]

and

[TABLE]

where $h:=-g$ . Hence

[TABLE]

But, from Theorem 1, we know that there exists $\mu\in\mathcal{P}(S)$ for which the RHS of the preceding equation equals $-R_{\alpha}(\nu\|\theta)$ . This shows that

[TABLE]

which establishes (10) in this case.

It remains to consider the case $\alpha<0$ . Let $\beta:=1-\alpha$ , so $\beta>1$ . We have already proved that

[TABLE]

where $h:=-g$ . Observing that $R_{\beta}(\theta\|\nu)=R_{\alpha}(\nu\|\theta)$ , this can be rewritten as

[TABLE]

which is (10) in this case, and completes the proof. $\Box$

Appendix D Proof of (6)

The second equality in (6) follows from the fact that $D(\theta\|\mu)=\infty$ if $\theta\npreceq\mu$ .

Given $\mu\in\mathcal{M}(S\times S)$ and $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ , the matrix $M:=\left[e^{g(i,j)}\mu(j|i)\right]$ has nonnegative entries and is compatible with $\mu$ , so $\rho(M)$ , i.e. the LHS of (6), is finite. Let $S_{\mu}=\cupdot_{k=1}^{l}C_{k}$ be the decomposition of the support of $\mu$ into classes. We may choose the indexing of the classes in such a way that $\rho(M)=\log\lambda_{1}(M)$ .

Let $u$ be a $1\times d$ row vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero left eigenvector of the restriction of $M$ to $C_{1}$ . Note that all the entries of $u$ in the coordinates in $C_{1}$ are strictly positive. Similarly, let $w$ be a $d\times 1$ column vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero right eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $w$ in the coordinates in $C_{1}$ will be strictly positive. For $i,j\in S$ , we define

[TABLE]

where $Z:=\sum_{i,j\in S}u(i)e^{g(i,j)}\mu(j|i)w(j)$ , which is strictly positive. Note that $\theta\in\mathcal{M}(S\times S)$ and $\theta\preceq\mu$ . We also have, for all $i\in S$ ,

[TABLE]

so we get

[TABLE]

where we have used the fact that $S_{\theta}=C_{1}$ .

We may now compute

[TABLE]

which also equals of the LHS of (6). This establishes that for each $\mu\in\mathcal{M}(S\times S)$ and $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ there exists $\theta\in\mathcal{M}(S\times S)$ achieving equality in (6).

It remains to show that for all $\theta\in\mathcal{M}(S\times S)$ such that $\theta\preceq\mu$ we have

[TABLE]

But, using (2) applied to the probability distribution $\mu_{n}$ on $(S^{n},\mathcal{F}_{n})$ , for $n\geq 2$ , with $g(i_{1},\ldots,i_{n}):=\sum_{k=1}^{n-1}g(i_{k},i_{k+1})$ , we have already proved that

[TABLE]

Divide both sides by $n$ and take the limit as $n\to\infty$ . Appealing to (16) and the definition of the growth rate in (14) proves (26). This completes the proof of (6). $\Box$

Appendix E Proof of the existence of the limit in (18), and of Proposition 1

Suppose $\alpha>1$ and $\nu\npreceq\theta$ . Then $\nu_{n}\npreceq\theta_{n}$ for all $n\geq 2$ and so the limit on the RHS of (18) exists and equals $\infty$ , as claimed in Proposition 1.

If $\alpha>1$ and $\nu\preceq\theta$ , then $\nu_{n}\preceq\theta_{n}$ for all $n\geq 2$ , and so

[TABLE]

This is also the formula for $R_{\alpha}(\nu_{n}\|\theta_{n})$ when $0<\alpha<1$ , irrespective of whether $\nu\preceq\theta$ or not. It follows from the definition of the growth rate in (14) that the limit on the RHS of (18) exists and equals $\frac{1}{\alpha(\alpha-1)}\rho(\left[\nu(j|i)^{\alpha}\theta(j|i)^{1-\alpha}\right])$ , as claimed in Proposition 1.

Finally, suppose $\alpha<0$ . Let $\beta:=1-\alpha$ . Then we have $\beta>1$ . We have therefore already proved that $\lim_{n\to\infty}\frac{1}{n}R_{\beta}(\theta_{n}\|\nu_{n})$ exists and equals $R_{1-\alpha}(\theta\|\nu)$ , as given in Proposition 1. But $R_{\beta}(\theta_{n}\|\nu_{n})$ equals $R_{\alpha}(\nu_{n}\|\theta_{n})$ . Therefore the limit on the RHS of (18) exists, and since this is what we call $R_{\alpha}(\nu\|\theta)$ it must be the case that $R_{\alpha}(\nu\|\theta)$ equals $R_{1-\alpha}(\theta\|\nu)$ , as claimed in Proposition 1. This completes the proof. $\Box$

Appendix F Proof of Theorem 3 assuming the truth of Theorem 2 and using (6), and proofs of the two claims about

(23)

We first verify the truth of the two claims about (23) which were made just after the statement of Theorem 3.

To exhibit the equivalence of the two forms (22) and (23) appearing in Theorem 3, assume, for instance, the truth of (22). Let $\beta:=1-\alpha$ and $H=\left[h(i,j)\right]=-G$ , and conclude that for all $\beta\in\mathbb{R}\backslash\{0,1\}$ , $\nu\in\mathcal{M}(S\times S)$ , and $H\in\mathbb{R}^{d\times d}$ we have

[TABLE]

or equivalently that

[TABLE]

which is (23). One can similarly go in the opposite direction.

To verify that the supremum on the RHS of (23) is achieved, given $\theta\in\mathcal{M}(S\times S)$ , $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ , and $\alpha\in\mathbb{R}\backslash\{0,1\}$ , observe that $N:=\left[e^{\alpha g(i,j)}\theta(j|i)\right]$ is compatible with $\theta$ . Let $S_{\mu}=\cupdot_{k=1}^{l}C_{k}$ be the decomposition of the support of $\theta$ into classes. We may choose the indexing of the classes in such a way that $\rho(N)=\log\lambda_{1}(N)$ .

Let $M:=\left[e^{g(i,j)}\theta(j|i)\right]$ . Observe that $M$ is also compatible with $\theta$ . Let $u$ be a $1\times d$ row vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero left eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $u$ in the coordinates in $C_{1}$ are strictly positive. Similarly, let $w$ be a $d\times 1$ column vector whose entries are zero in the coordinates that are not in $C_{1}$ , while its restriction to $C_{1}$ is a nonzero right eigenvector of the restriction of $M$ to $C_{1}$ . All the entries of $w$ in the coordinates in $C_{1}$ will be strictly positive. For $i,j\in S$ , we define

[TABLE]

where $Z:=\sum_{i,j\in S}u(i)e^{g(i,j)}\mu(j|i)w(j)$ , which is strictly positive. Note that $\nu\in\mathcal{M}(S\times S)$ and $\nu\preceq\theta$ . We also have, for all $i\in S$ ,

[TABLE]

so we get

[TABLE]

where we have used the fact that $S_{\nu}=C_{1}$ .

We now note that

[TABLE]

Then we have

[TABLE]

Here the first step can be seen by observing that the $w(i)^{\alpha}$ terms for $i\in C_{1}$ cancel each other out by successive cancellation in the defintion of the growth rate as a limit. Equality in the second step depends on the fact that we have chosen $C_{1}$ such that $\rho(N)=\log\lambda_{1}(N)$ .

We also note that

[TABLE]

so we have

[TABLE]

Here the first step can be seen by observing that the $w(i)$ terms for $i\in C_{1}$ cancel each other out by successive cancellation in the defintion of the growth rate as a limit, and equality in the second step depends on the fact that we have chosen $C_{1}$ such that $\rho(N)=\log\lambda_{1}(N)$ .

Since $\nu\preceq\theta$ , we have

[TABLE]

Multiplying (27) through by $\frac{1}{\alpha(\alpha-1)}$ and using (28) gives

[TABLE]

which demonstrates that $\nu$ works to show what what was claimed.

In order to prove Theorem 3, it remains to show that for every $\theta,\nu\in\mathcal{M}(S\times S)$ , $G=\left[g(i,j)\right]\in\mathbb{R}^{d\times d}$ , and $\alpha\in\mathbb{R}\backslash\{0,1\}$ , we have

[TABLE]

We prove this, assuming the truth of Theorem 2, using (6). The proof is almost a verbatim copy of that in Appendix C, except that we are now dealing with the case of stationary finite state Markov chains rather than with the i.i.d. case.

Consider first the case $\alpha>1$ . We may then assume that $\nu\preceq\theta$ , since otherwise the right hand side of (29) is $-\infty$ . From (6), we have, for all $\mu\in\mathcal{M}(S\times S)$ such that $\mu\preceq\nu$ that

[TABLE]

From (19) we have

[TABLE]

which means that

[TABLE]

Taking the supremum over $\mu\preceq\nu$ on the RHS of the preceding equation and using (6) gives

[TABLE]

which was to be shown.

Next, suppose $0<\alpha<1$ . There is no $\mu\in\mathcal{M}(S\times S)$ such that $\mu\preceq\nu$ and $\mu\preceq\theta$ precisely when the directed graph associated to $\left[\nu(i,j)^{\alpha}\theta(i,j)^{1-\alpha}\right]$ has no cycles, and in this case $R_{\alpha}(\nu\|\theta)=\infty$ , so (29) is true. Therefore, we may assume that we can find $\mu\in\mathcal{P}(S)$ such that $\mu\preceq\nu$ and $\mu\preceq\theta$ . We know from (6) that for every $\mu\in\mathcal{M}(S\times S)$ we have

[TABLE]

and

[TABLE]

where $h:=-g$ . Hence

[TABLE]

But, from Theorem 2, we know that there exists $\mu\in\mathcal{M}(S\times S)$ for which the RHS of the preceding equation equals $-R_{\alpha}(\nu\|\theta)$ . This shows that

[TABLE]

which establishes (29) in this case.

It remains to consider the case $\alpha<0$ . Let $\beta:=1-\alpha$ , so $\beta>1$ . We have already proved that

[TABLE]

where $h:=-g$ . Observing that $R_{\beta}(\theta\|\nu)=R_{\alpha}(\nu\|\theta)$ , this can be rewritten as

[TABLE]

which is (29) in this case, and completes the proof. $\Box$

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2]
3[3] Rami Atar, Kenny Chowdhary, and Paul Dupuis. “Robust Bounds on Risk-Sensitive Functionals via Rényi Divergence”, SIAM/ASA Journal on Uncertainty Quantification , Vol. 3, pp. 18 -33, 2015.
4[4] Patrick Billingsley. Probability and Measure. Second Edition, John Wiley & Sons Inc., New York, 1986.
5[5] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Second Edition. Applications of Mathematics, Stochastic Modelling and Applied Probability, Vol. 38, Springer-Verlag, New York, 1998.
6[6] Tim van Erven and Peter Harremoës. “Rényi Divergence and Kullback-Leibler Divergence”, IEEE Transactions on Information Theory , Vol 60, No. 7, pp. 3797 -3820, 2014.
7[7] Robert M. Gray. Entropy and Information Theory . Second Edition, Springer Science + Business Media, New York, 2011.
8[8] Jingbo Liu, Thomas A. Courtade, Paul Cuff, and Sergio Verdú. “Brascamp-Lieb Inequality and its Reverse: An Information Theoretic View”, Proceedings of the 2016 IEEE International Symposium on Information Theory , IEEE Press, pp. 1048 -1052, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

1 Introduction

2 Setup

Remark 1**.**

Remark 2**.**

3 Statement of the main result in the i.i.d. case

Theorem 1**.**

Remark 3**.**

4 Discussion

5 Proof of Theorem 1

6 Rényi divergence rate between stationary finite state Markov chains

Proposition 1**.**

7 Main results in the Markov case

Theorem 2**.**

Theorem 3**.**

8 Proof of Theorem 2

9 Concluding remarks

Acknowledgments

Appendix A Proof of the elementary variational formula in (2)

Appendix B Proof that the supremum in (9) is achieved

Appendix C Proof of (10)

Appendix D Proof of (6)

Appendix E Proof of the existence of the limit in (18), and of Proposition 1

Appendix F Proof of Theorem 3 assuming the truth of Theorem 2 and using (6), and proofs of the two claims about

Remark 1.

Remark 2.

Theorem 1.

Remark 3.

Proposition 1.

Theorem 2.

Theorem 3.