A Variational Characterization of R\'enyi Divergences
Venkat Anantharam

TL;DR
This paper presents a new variational characterization of Re9nyi divergences between probability distributions and Markov chains, linking them to relative entropies and extending existing formulas.
Contribution
It develops a novel variational formula for Re9nyi divergences using relative entropies, applicable to both probability distributions and Markov chains.
Findings
Derived a variational formula for Re9nyi divergences between distributions.
Extended the variational characterization to stationary finite state Markov chains.
Connected the results with Varadhan's variational formula for spectral radius.
Abstract
Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of R\'enyi divergences. We develop a variational characterization of the R\'enyi divergences between two probability distributions on a measurable sace in terms of relative entropies. When combined with the elementary variational formula for exponential integrals of bounded measurable functions in terms of relative entropy, this yields the variational formula of Atar, Chowdhary and Dupuis as a corollary. We also develop an analogous variational characterization of the R\'enyi divergence rates between two stationary finite state Markov chains in terms of relative entropy rates. When combined with Varadhan's variational characterization of the spectral radius of square matrices with nonnegative entries in terms of relative entropy, this yields an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A VARIATIONAL CHARACTERIZATION OF RÉNYI DIVERGENCES
VENKAT ANANTHARAM111EECS Department, University of California, Berkeley, CA 94720, USA. Research supported in part by the National Science Foundation grants ECCS-1343398, CNS-1527846, CCF-1618145, the NSF Science & Technology Center grant CCF-0939370 (Science of Information), and the William and Flora Hewlett Foundation Center for Long Term Cybersecurity at Berkeley.
ABSTRACT: Atar, Chowdhary and Dupuis have recently exhibited a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We develop a variational characterization of the Rényi divergences between two probability distributions on a measurable space in terms of relative entropies. When combined with the elementary variational formula for exponential integrals of bounded measurable functions in terms of relative entropy, this yields the variational formula of Atar, Chowdhary and Dupuis as a corollary.
We also develop an analogous variational characterization of the Rényi divergence rates between two stationary finite state Markov chains in terms of relative entropy rates. When combined with Varadhan’s variational characterization of the spectral radius of square matrices with nonnegative entries in terms of relative entropy, this yields an analog of the variational formula of Atar, Chowdary and Dupuis in the framework of finite state Markov chains.
Key words: Markov chains; Relative entropy; Rényi divergence; Variational formulas.
1 Introduction
Evaluating how far away a given probability distribution is from another can be done in many ways. The Kullback-Leibler divergence or relative entropy, which is closely tied to Shannon’s notion of entropy, is one such measure prominent in statistical applications. It belongs to a larger family of divergences, the so-called Rényi divergences, which are closely tied to Rényi’s notion of entropy. Rényi divergences also have numerous applications in problems of interest in statistics and information theory, see [6] for a survey of some of their basic properties and some indication of their applications. The Rényi divergences, with a minor change in scaling relative to the definition in [6], are the topic of this article. We treat the Rényi divergences as parametrized by a real number , , .
We were prompted to write this document by reading a recent paper of Atar, Chowdhary and Dupuis [3], which provides a variational formula for exponential integrals of bounded measurable functions in terms of Rényi divergences. We show that the variational characterization in [3] is a simple consequence of a variational characterization for Rényi divergences in terms of relative entropies, which we also develop. For the case of probability distributions on a finite set, and in the range , , our variational characterization for Rényi divergences was developed by Shayevitz, [12] and [13, Thm. 1]. More recently, for mutually absolutely continuous probability distributions on a measurable space, in the case , , parts of this variational characterization appear in a paper of Sason, see [10, Lem. 4 and Cor. 2]. The ability to derive the variational formula of [3] from inequalities for the Rényi divergences in terms of relative entropies, in the case , is also remarked on in a recent paper of Liu, Courtade, Cuff, and Verdú [8, Sec. II-A]. To the best of our knowledge, however, a full treatment of this variational characterization of Rényi divergences in terms of relative entropies, covering an arbitrary pair of probability distributions on a measurable space and all possible values for , does not appear to be in the literature and so it seems worth writing down. It is also worth noting how easily the full variational formula of [3], in all cases, falls out of this variational characterization of Rényi divergences.
Section 2 presents the notational conventions and the definitions of the main quantities used in this document in the i.i.d. case. The main result in the i.i.d. case, Theorem 1, is stated in Section 3. The result of [3] that prompted this paper is presented in Section 4, and is derived there as a consequence of Theorem 1 and the elementary variational formula for exponential integrals in (2). Theorem 1 itself is proved in Section 5.
We then turn to a development of analogs of the preceding results in the case of stationary finite state Markov chains. Section 6 makes the necessary definitions and gathers some standard facts about the asymptotic properties of iterated powers of a square matrix with nonnegative entries, which we need for our discussion. It also contains the analog of the elementary variational formula in the context of finite state Markov chains, in (6), which is Varadhan’s variational characterization in terms of relative entropy of the spectral radius of square matrices with nonnegative entries. The main results in the case of stationary finite state Markov chains are stated in Section 7. These are Theorem 2, which gives a variational characterization of each Rényi divergence rate between two stationary finite state Markov chains in terms of relative entropy rates, and Theorem 3, which gives an analog of the variational formula of [3] in the context of finite state Markov chains. A proof of Theorem 3 assuming the truth of Theorem 2, and using (6), is also provided in this section. The proof of Theorem 2 is provided in Section 8. We end the paper in Section 9 with some thoughts about directions for future work.
In order to maintain the flow of the main exposition, the details of several proofs are relegated to appendices.
2 Setup
Let be a measurable space. denotes the set of bounded measurable real-valued functions and the set of probability measures on . For , is notation for being absolutely continuous with respect to , see [4, pg. 442] for the definition. If , then denotes the Radon-Nikodym derivative of with respect to ; any two choices of Radon-Nikodym derivative differ only on a -null set, see [4, Thm. 32.2]. The relative entropy of with respect to is defined by
[TABLE]
From the convexity of the function for nonnegative , one can check that .
Here, and in the rest of the paper, is notation for equality by definition. Logarithms can be assumed to be to the natural base. For two measurable functions and on , not necessarily bounded, and , denotes equality of and except possibly on an -null set. Similarly, for , denotes equality of and up to -null sets and denotes the containment of in up to -null sets.
The variational characterization in (2) below of exponential integrals of bounded measurable functions is elementary. For any and we have
[TABLE]
We provide a proof in Appendix A.
For any , and , the Rényi divergence is defined as in eqn. (2.1) of [3], by first defining it for , , by
[TABLE]
where and , where is an arbitrary probability distribution such that and . It is straightforward to check that every choice of , subject to the absolute continuity conditions, results in the same value of the Rényi entropy. Then, for , we use the definition
[TABLE]
Remark 1**.**
Even though the definition of is broken up into cases above, a single formula would work, if suitably interpreted. One could write
[TABLE]
In this formula, if and , then because on this event, we are forced to intepret as being . A similar argument forces us to interpret as if and . Rather than requiring of the reader the mental gymnastics needed to keep track of such interpretations, we prefer to break the discussion up into cases.
Remark 2**.**
It is clear that (possibly ) if or . For , an application of Hölder’s inequality with and (so ) gives
[TABLE]
Hence we also have (possibly ) if . Note in particular that if , then for all .
3 Statement of the main result in the i.i.d. case
Our main result in the i.i.d case is the following variational characterization of Rényi divergence.
Theorem 1**.**
Let and . Then, if , we have
[TABLE]
while, if , we have
[TABLE]
and, if , we have
[TABLE]
Further, when , one can find , , , achieving the infimum on the RHS of (6), whenever is nonempty.
Remark 3**.**
The case by case structure of this result is partly a consequence of the normalization chosen for the Rényi divergences (which is necessary to make Rényi divergence nonnegative) and partly a consequence of the need to apply the correct absolute continuity conditions. If it considered desirable to write a singe formula covering all cases, this can be done by considering , for . Then one has the single formula
[TABLE]
for all . Note, however, that the set over which the supremum is being taken need not be convex in general. This is essential to avoid encountering expressions of the form .
4 Discussion
Atar, Chowdhary and Dupuis [3] have recently established a variational formula for exponential integrals of bounded measurable functions. This is established in two forms. For any , , and , eqn. (2.6) of [3] states that
[TABLE]
while eqn. (2.7) of [3] states that for any , , and we have
[TABLE]
It is straightforward to exhibit the equivalence of these two forms. For instance, assuming (8), let and , and conclude that for all , , and we have
[TABLE]
or equivalently that
[TABLE]
which is (9). One can similarly go in the opposite direction. We will therefore focus only on the form in (9). As observed in Remark 2.3 of [3], taking the limit as in (9) recovers the elementary variational formula for exponential integrals of bounded measurable functions in (2).
The structure of Theorem 1 is motivated by the variational characterization in (9). We will now demonstrate that Theorem 1 is at least as strong as (9) by deriving (9) from Theorem 1 and the elementary variational formula (2).
First of all, we show that for any , , and one can find achieving the supremum in (9). This proof does not depend on Theorem 1 and (2). In fact, the supremum is achieved by the choice , where is the normalization factor, and it is elementary to prove this. For completeness, a proof is included in Appendix B.
It remains to prove that for any , , and , we have
[TABLE]
Assuming the truth of Theorem 1, and using (2), this is proved in Appendix C.
5 Proof of Theorem 1
We now prove Theorem 1.
Consider first the case . Suppose . Then the LHS of (5) is . Also, in this case, we can choose such that but , which makes the RHS of (5) also equal to . Thus we may assume that . Given sufficiently large, define by
[TABLE]
where is chosen such that , and we define , , and . Further,
[TABLE]
and sufficiently large means that . We note that (and so ). Then
[TABLE]
which, as , converges to
[TABLE]
It remains to show that, in the case , for all such that , we have, for all such that , the inequality
[TABLE]
Pick such that (so we also have and ), and let , , and . Multiplying the RHS of (11) by gives
[TABLE]
On the other hand, we have
[TABLE]
so (11) follows from the concavity of the logarithm.
Next, consider the case when . Pick such that and , and let and . If , then , and so
[TABLE]
But we also have , so the RHS of (6) equals . We may therefore assume that . Now, an application of Hölder’s inequality with and (so ) gives
[TABLE]
Let be defined by , where . Note that . We have and , as required on the RHS of (6). Now,
[TABLE]
which equals . It remains to show that, in the case , for all such that , we have, for all such that and , the inequality
[TABLE]
To see this, note that
[TABLE]
where is the negative logarithm function, which is decreasing and convex. This establishes (12). Note that we have also estabished the claim in Theorem 1 that when one can find realizing the infimum in (6) whenever is nonempty.
It remains to consider the case where . Let . Then . By definition . However, we have already proved that
[TABLE]
This reads
[TABLE]
which establishes (7) in this case also and completes the proof of Theorem 1.
6 Rényi divergence rate between stationary finite state Markov chains
In this section we set the stage to present analogs of the preceding results involving the Rényi divergence rates between two stationary finite state Markov chains. Extensions to general state space Markov processes in both discrete and continuous time of a form similar to those we will present for stationary finite state Markov chains no doubt exist, under suitable conditions on the transition kernel, but may be considered topics for future work.
From this point onwards in this document we take and to be comprised of all the subsets of . Let denote the set of Markov probability distributions on , where if for all , , and for all , where and . Here is comprised of all the subsets of .
Given , let . is a subset of , and is called the support of . For and , we define . Note that if and , and . For , we define for all . This may seem strange, but is an important notational convention for the equations we are going to write. Note that for .
Given we say is absolutely continuous with respect to , denoted , if for all . The relative entropy of with respect to is defined by
[TABLE]
It can be checked that .
We need certain basic facts about the asymptotic properties of iterated powers of square matrices with nonnegative entries. We will state these facts in narrative form. Proofs can be extracted from several books that provide standard treatments of the theory of nonnegative matrices or finite state Markov chains, see e.g. [11, Chap. 1].
Let be a matrix with nonnegative entries. Then the limit
[TABLE]
exists, where denotes the entry of . We can associate to a directed graph on the vertex set , where we have a directed edge from to iff . This graph may have self loops. Then iff this directed graph does not have a directed cycle. Otherwise is finite. We call the growth rate of .
Suppose is finite. We say is absolutely continuous with respect to if for all Let be absolutely continuous with respect to . Then so is . Thus there is a maximum element that is absolutely continuous with respect to , in the sense that every other that is absolutely continuous with respect to satisfies . This maximum element need not be unique. Pick any such maximum element, call it . Let . Then .
Let , which we also think of as a nonnegative matrix. The support of can be uniquely written as a disjoint union of subsets, called classes, , for some , such that if are in distinct classes, and such that, for each , if we consider the restriction of the directed graph associated to to the vertices in the class , then this directed graph is irreducible, in the sense that there is a directed path in the graph between any pair of vertices in .
Given and a matrix with nonnegative entries, we say is compatible with if . Let be the decomposition of the support of into classes. For each , the restriction of to the coordinates in defines a irreducible matrix with nonnegative entries. This matrix has an associated Perron-Frobenius eigenvalue, which we denote by . We have for all . We have . Also, for each , the restriction of to the coordinates in has a left eigenvector associated to the eigenvalue , which has all its coordinates strictly positive and is unique up to scaling, and also a right eigenvector associated to the eigenvalue , which has all its coordinates strictly positive and is unique up to scaling.
Given , what we mean by the stationary Markov chain defined by is the following: for each define a probability distribution on , where is comprised of all subsets of , by setting
[TABLE]
It is straightfoward to check that for all and we have
[TABLE]
The following fact, which will be very useful later, is easy to verify from the definitions. It holds for all .
[TABLE]
where on the RHS of this defintion the notation refers to the relative entropy between probability distributions on .
We are now in a position where we can state the analog for stationary finite state Markov chains of the elementary variational formula (2). Let and . We have the following variational characterization of the growth rate of the exponential integral of along the stationary Markov chain defined by .
[TABLE]
The proof is in Appendix D. The result is standard, being Varadhan’s characterization of the spectral radius of nonnegative matrices, see e.g. [5, Exer. 3.1.19].
We are also in a position to define the Rényi divergence rates between two stationary finite state Markov chains. This definition is classical, see e.g. the paper of Rached, Alajaji, and Campbell [9], which also considers the nonstationary case, and the references therein. Given and , we define the Rényi divergence rate of with respect to , denoted , by
[TABLE]
where on the RHS of this defintion the notation refers to the Rényi divergence between probability distributions on defined as in (3) and (4). The proofs of the existence of the limit in (18) as well as of the properties of the Rényi divergence rate of interest to us, which are stated in the following proposition, are in Appendix E.
Proposition 1**.**
Given , the Rényi divergence rate, as defined in (18), satisfies the following properties:
[TABLE]
and
[TABLE]
7 Main results in the Markov case
Our first main result in the Markov case is the following variational characterization of the Rényi divergence rate, which is a direct analog of Theorem 1.
Theorem 2**.**
Let and . Then, if , we have
[TABLE]
while, if , we have
[TABLE]
and, if , we have
[TABLE]
Further, one can find achieving the extremum on the RHS in all three cases, except in the case where and is empty.
Our second main result in the Markov case is the following analog of the variational formula of [3].
Theorem 3**.**
For any , , and , we have
[TABLE]
and for any , , and we have
[TABLE]
It is straightforward to exhibit the equivalence of the claims in (22) and (23). This is done is Appendix F. It therefore suffices to focus only on the form in (23). It is straightforward to show that for each and , one can find achieving the supremum on the RHS of (23). Appendix F also contains a demonstration of this fact. A proof of Theorem 3, assuming the truth of Theorem 2, and using (6), is also provided in Appendix F.
8 Proof of Theorem 2
Suppose . If , taking on the RHS of (19) makes the RHS equal , which is also the value of the LHS. We may therefore assume that .
Let . This matrix is compatible with . Let be the decomposition of the support of into classes. We may choose the indexing of the classes in such a way that .
Let be a row vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero left eigenvector of the restriction of to . All the entries of in the coordinates in are strictly positive. Similarly, let be a column vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero right eigenvector of the restriction of to . All the entries of in the coordinates in will be strictly positive. For , we define
[TABLE]
where , which is strictly positive. Note that and . We also have, for all ,
[TABLE]
so we get
[TABLE]
where we have used the fact that .
Multiplying the RHS of (19) by for this choice of gives
[TABLE]
which also equals times the LHS of (19). This establishes the existence of satisfying and achieving equality in (19).
It remains to check that for all satisfying we have the inequality
[TABLE]
But, in view of (15), in (5) applied to probability distributions on , for , we have already proved that
[TABLE]
Dividing by , letting , and appealing to (16) establishes (24).
Next, consider the case where . If the directed graph associated to the matrix has no cycles, then , and , so the RHS of (20) is also , and so (20) holds in this case. We may therefore assume that is nonempty. Pick any that is a maximum element among all the elements of that are absolutely continuous with respect to . Let . Then . Further, is compatible with .
Let be the decomposition of the support of into classes. We may choose the indexing of the classes in such a way that .
Let be a row vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero left eigenvector of the restriction of to . All the entries of in the coordinates in are strictly positive. Similarly, let be a column vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero right eigenvector of the restriction of to . All the entries of in the coordinates in will be strictly positive. For , we define
[TABLE]
where , which is strictly positive. Note that and , so and . We also have, for all ,
[TABLE]
so we get
[TABLE]
where we have used the fact that .
Multiplying the RHS of (20) by for this choice of gives
[TABLE]
which also equals times the LHS of (20). This establishes the existence of satisfying and and achieving equality in (20).
It remains to check that for all satisfying and we have the inequality
[TABLE]
But, in view of (15), in (6) applied to probability distributions on , for , we have already proved that
[TABLE]
Dividing by , letting , and appealing to (16) establishes (25).
It remains to consider the case . Let . Then . By definition . However, we have already proved that
[TABLE]
This reads
[TABLE]
which establishes (21) in this case also and completes the proof of Theorem 2.
9 Concluding remarks
We have given a variational characterization of Rényi divergence between two arbitrary probability distributions on an arbitrary measurable space in terms of relative entropies, for all values of the parameter defining the Rényi divergence. We also gave a variational characterization of the Rényi divergence rate between two stationary finite state Markov chains in terms of relative entropy rates, for all values of the parameter defining the Rényi divergence rate. A consequence of the latter development was an analog of the variational formula of [3] for stationary finite state Markov chains.
While we restricted ourselves to stationary finite state Markov chains in the latter discussion, it is to be expected that there will be versions of this variational characterization of Rényi divergence rate in a much broader setting involving Markov or -th order Markov processes in discrete time, and also in continuous time. It would also be interesting to consider to what extent such a variational characterization might generalize to the Rényi divergence rates between an arbitrary pair of stationary processes, assuming the existence of the defining limit to start with, since even the understanding of the relative entropy rate at this level of generality is somewhat limited [7].
Acknowledgments
Thanks to Vivek Borkar and Payam Delgosha for their comments on a earlier draft of this document.
Appendix A Proof of the elementary variational formula in (2)
The second equality in (2) follows from the fact that if .
Given and , define by , where . Note that . Then
[TABLE]
which also equals of the LHS of (2).
It remains to show that for all we have
[TABLE]
Let . We have
[TABLE]
where the second step is justified by the concavity of the logarithm. This completes the proof.
Appendix B Proof that the supremum in (9) is achieved
Given and , let be defined by , where . Note that and are mutually absolutely continuous.
Thus, for all , , we have
[TABLE]
On the other hand
[TABLE]
which is the same.
Suppose now that . Let . Then . For any and , let be defined by . Then , where and . We have then already proved that
[TABLE]
which completes the proof.
Appendix C Proof of (10)
Consider first the case . We may then assume that , since otherwise the right hand side of (10) is . From (2), we have, for all such that that
[TABLE]
From (5) we have
[TABLE]
which means that
[TABLE]
Taking the supremum over on the RHS of the preceding equation and using (2) gives
[TABLE]
which was to be shown.
Next, suppose . Given and , if for some (and hence every) such that and (where and ), then , and so (10) is true. Otherwise, we can find such that and . We know from the elementary variational formula (2) that for every we have
[TABLE]
and
[TABLE]
where . Hence
[TABLE]
But, from Theorem 1, we know that there exists for which the RHS of the preceding equation equals . This shows that
[TABLE]
which establishes (10) in this case.
It remains to consider the case . Let , so . We have already proved that
[TABLE]
where . Observing that , this can be rewritten as
[TABLE]
which is (10) in this case, and completes the proof.
Appendix D Proof of (6)
The second equality in (6) follows from the fact that if .
Given and , the matrix has nonnegative entries and is compatible with , so , i.e. the LHS of (6), is finite. Let be the decomposition of the support of into classes. We may choose the indexing of the classes in such a way that .
Let be a row vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero left eigenvector of the restriction of to . Note that all the entries of in the coordinates in are strictly positive. Similarly, let be a column vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero right eigenvector of the restriction of to . All the entries of in the coordinates in will be strictly positive. For , we define
[TABLE]
where , which is strictly positive. Note that and . We also have, for all ,
[TABLE]
so we get
[TABLE]
where we have used the fact that .
We may now compute
[TABLE]
which also equals of the LHS of (6). This establishes that for each and there exists achieving equality in (6).
It remains to show that for all such that we have
[TABLE]
But, using (2) applied to the probability distribution on , for , with , we have already proved that
[TABLE]
Divide both sides by and take the limit as . Appealing to (16) and the definition of the growth rate in (14) proves (26). This completes the proof of (6).
Appendix E Proof of the existence of the limit in (18), and of Proposition 1
Suppose and . Then for all and so the limit on the RHS of (18) exists and equals , as claimed in Proposition 1.
If and , then for all , and so
[TABLE]
This is also the formula for when , irrespective of whether or not. It follows from the definition of the growth rate in (14) that the limit on the RHS of (18) exists and equals , as claimed in Proposition 1.
Finally, suppose . Let . Then we have . We have therefore already proved that exists and equals , as given in Proposition 1. But equals . Therefore the limit on the RHS of (18) exists, and since this is what we call it must be the case that equals , as claimed in Proposition 1. This completes the proof.
Appendix F Proof of Theorem 3 assuming the truth of Theorem 2 and using (6), and proofs of the two claims about
(23)
We first verify the truth of the two claims about (23) which were made just after the statement of Theorem 3.
To exhibit the equivalence of the two forms (22) and (23) appearing in Theorem 3, assume, for instance, the truth of (22). Let and , and conclude that for all , , and we have
[TABLE]
or equivalently that
[TABLE]
which is (23). One can similarly go in the opposite direction.
To verify that the supremum on the RHS of (23) is achieved, given , , and , observe that is compatible with . Let be the decomposition of the support of into classes. We may choose the indexing of the classes in such a way that .
Let . Observe that is also compatible with . Let be a row vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero left eigenvector of the restriction of to . All the entries of in the coordinates in are strictly positive. Similarly, let be a column vector whose entries are zero in the coordinates that are not in , while its restriction to is a nonzero right eigenvector of the restriction of to . All the entries of in the coordinates in will be strictly positive. For , we define
[TABLE]
where , which is strictly positive. Note that and . We also have, for all ,
[TABLE]
so we get
[TABLE]
where we have used the fact that .
We now note that
[TABLE]
Then we have
[TABLE]
Here the first step can be seen by observing that the terms for cancel each other out by successive cancellation in the defintion of the growth rate as a limit. Equality in the second step depends on the fact that we have chosen such that .
We also note that
[TABLE]
so we have
[TABLE]
Here the first step can be seen by observing that the terms for cancel each other out by successive cancellation in the defintion of the growth rate as a limit, and equality in the second step depends on the fact that we have chosen such that .
Since , we have
[TABLE]
Multiplying (27) through by and using (28) gives
[TABLE]
which demonstrates that works to show what what was claimed.
In order to prove Theorem 3, it remains to show that for every , , and , we have
[TABLE]
We prove this, assuming the truth of Theorem 2, using (6). The proof is almost a verbatim copy of that in Appendix C, except that we are now dealing with the case of stationary finite state Markov chains rather than with the i.i.d. case.
Consider first the case . We may then assume that , since otherwise the right hand side of (29) is . From (6), we have, for all such that that
[TABLE]
From (19) we have
[TABLE]
which means that
[TABLE]
Taking the supremum over on the RHS of the preceding equation and using (6) gives
[TABLE]
which was to be shown.
Next, suppose . There is no such that and precisely when the directed graph associated to has no cycles, and in this case , so (29) is true. Therefore, we may assume that we can find such that and . We know from (6) that for every we have
[TABLE]
and
[TABLE]
where . Hence
[TABLE]
But, from Theorem 2, we know that there exists for which the RHS of the preceding equation equals . This shows that
[TABLE]
which establishes (29) in this case.
It remains to consider the case . Let , so . We have already proved that
[TABLE]
where . Observing that , this can be rewritten as
[TABLE]
which is (29) in this case, and completes the proof.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1]
- 2[2]
- 3[3] Rami Atar, Kenny Chowdhary, and Paul Dupuis. “Robust Bounds on Risk-Sensitive Functionals via Rényi Divergence”, SIAM/ASA Journal on Uncertainty Quantification , Vol. 3, pp. 18 -33, 2015.
- 4[4] Patrick Billingsley. Probability and Measure. Second Edition, John Wiley & Sons Inc., New York, 1986.
- 5[5] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. Second Edition. Applications of Mathematics, Stochastic Modelling and Applied Probability, Vol. 38, Springer-Verlag, New York, 1998.
- 6[6] Tim van Erven and Peter Harremoës. “Rényi Divergence and Kullback-Leibler Divergence”, IEEE Transactions on Information Theory , Vol 60, No. 7, pp. 3797 -3820, 2014.
- 7[7] Robert M. Gray. Entropy and Information Theory . Second Edition, Springer Science + Business Media, New York, 2011.
- 8[8] Jingbo Liu, Thomas A. Courtade, Paul Cuff, and Sergio Verdú. “Brascamp-Lieb Inequality and its Reverse: An Information Theoretic View”, Proceedings of the 2016 IEEE International Symposium on Information Theory , IEEE Press, pp. 1048 -1052, 2016.
