Concentration of Markov chains with bounded moments
Assaf Naor, Shravas Rao, Oded Regev

TL;DR
This paper extends concentration inequalities for finite state Markov chains to cases where the function has bounded moments rather than being bounded, providing dimension-independent bounds and answering a question by Kargin.
Contribution
It introduces new concentration inequalities assuming only bounded moments of the function, generalizing Gillman's bounds and addressing an open question by Kargin.
Findings
Derived moment-based concentration inequalities for Markov chains
Generalized bounds to $L_p$-valued functions, including Hilbert spaces
Provided dimension-independent concentration bounds
Abstract
Let be a finite state stationary Markov chain, and suppose that is a real-valued function on the state space. If is bounded, then Gillman's expander Chernoff bound (1993) provides concentration estimates for the random variable that depend on the spectral gap of the Markov chain and the assumed bound on . Here we obtain analogous inequalities assuming only that the 'th moment of is bounded for some . Our proof relies on reasoning that differs substantially from the proofs of Gillman's theorem that are available in the literature, and it generalizes to yield dimension-independent bounds for mappings that take values in an for some , thus answering (even in the Hilbertian special case ) a question of Kargin (2007).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Concentration of Markov chains with bounded moments
Assaf Naor Mathematics Department, Princeton University. Supported by the Packard Foundation and the Simons Foundation. The research that is presented here was conducted under the auspices of the Simons Algorithms and Geometry (A&G) Think Tank.
Shravas Rao Courant Institute of Mathematical Sciences, New York University. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1342536.
Oded Regev Courant Institute of Mathematical Sciences, New York University. Supported by the Simons Collaboration on Algorithms and Geometry and by the National Science Foundation under Grant No. CCF-1814524. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
Abstract
Let be a finite state stationary Markov chain, and suppose that is a real-valued function on the state space. If is bounded, then Gillman’s expander Chernoff bound (1993) provides concentration estimates for the random variable that depend on the spectral gap of the Markov chain and the assumed bound on . Here we obtain analogous inequalities assuming only that the ’th moment of is bounded for some . Our proof relies on reasoning that differs substantially from the proofs of Gillman’s theorem that are available in the literature, and it generalizes to yield dimension-independent bounds for mappings that take values in an for some , thus answering (even in the Hilbertian special case ) a question of Kargin (2007).
1 Introduction
For , write and let \triangle^{\!N-1}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\big{\{}\pi=(\pi_{1},\ldots,\pi_{N})\in[0,1]^{N}:\ \sum_{i=1}^{N}\pi_{i}=1\big{\}} be the simplex of probability measures on . Given , denote by the -by- matrix all of whose rows equal , i.e., for every .
Given , a stochastic matrix is -stationary if , i.e., for all . We then define to be the norm of as an operator from to , i.e.,
[TABLE]
Note that if is diagonalizable over the Hilbert space , then we have , where are the eigenvalues of . This would occur if were -reversible, i.e., for all , in which case would be a self-adjoint operator on ; the reversible setting is the main case of interest in the ensuing discussion, but reversibility is not needed for our proofs.
Let be a Markov chain with state space and transition matrix . One says that is stationary if is -stationary for . Write .
Theorem 1.1**.**
Suppose that is a stationary Markov chain whose state space is and with . Then, every satisfies the following inequality for every and every .
[TABLE]
**
The (standard) asymptotic notation that appears in (1) (as well as throughout the ensuing discussion) means the following. Given two quantities , the notation stands for the assertion that there exists a universal constant for which ; this is also denoted by .
The conclusion (1) of Theorem 1.1 with the random variables replaced by i.i.d. random variables coincides with the classical Marcinkiewicz–Zygmund inequality [MZ37]. Our contribution here is therefore to generalize this statement to random variables that are (images of) stationary Markov chains with a spectral gap; the i.i.d. setting is the special case of Theorem 1.1. The bound (1) is optimal; see Remark 4 below. A variant of Theorem 1.1 when appears in Remark 3 below.
The precursor (and inspiration) of Theorem 1.1 is the following theorem of Gillman [Gil93, Gil98].
Theorem 1.2**.**
Suppose that is a stationary Markov chain whose state space is and with . Then, every satisfies the following inequality for every and every .
[TABLE]
Note that Theorem 1.2 is typically stated in the literature as the following concentration inequality, which is commonly called the expander Chernoff bound.
[TABLE]
where is a universal constant. The equivalence of (2) and (3) is standard; is checked by applying Markov’s inequality and optimizing over , and follows by straightforward integration (both implications appear in Proposition 2.5.2 of the textbook [Ver18]). The same use of Markov’s inequality shows mutatis mutandis that Theorem 1.1 implies the following concentration phenomenon.
Corollary 1.3**.**
There is a universal constant with the following property. Suppose that is a stationary Markov chain whose state space is and with . Then, every satisfies the following inequality for every , every and every .
[TABLE]
Remark 1**.**
Kloeckner investigated in [Klo19] the question of obtaining concentration bounds such as (3) with the norm replaced by other norms of . As discussed in [Klo19, Remark 2.2], the results of [Klo19] hold in a setting that imposes structural hypotheses on the aforementioned norm of the “observable” which notably excludes its norm (which appears in the right-hand side of the bound (1) that we prove here), but it is noted in [Klo19, Remark 2.2] that “classically one only makes moment assumptions on the observable.” Corollary 1.3 addresses this question, though note that [Klo19] also covers settings that are not treated here.
The new bound (1) that we obtain differs from Gillman’s estimate (2) only in the replacement of the worst-case bound on in the right-hand side of (2) by an average-case bound. Rather than being merely a quantitative enhancement, this improvement has conceptual significance which we achieve through a reasoning that differs substantially from the proof of (3) in [Gil93, Gil98], as well as the several other proofs of (3) and its variants that appeared in the literature [Din95, Kah97, Lez98, LP04, Kar07, Wag08, CLLM12, Pau15, GLSS18, FJS18, Klo19] (our approach was recently used in [RR17, Rao19]).
Assuming a bound on the ’th moment of is the appropriate setting for bounding the ’th moment of . This compatibility of the left-hand side of (1) and the right-hand side of (1) allows the resulting inequality to tensorize so as to yield dimension-independent vector-valued statements. Specifically, for any measure space , if , then by applying (1) to the real-valued mapping for each , and then integrating the (’th power of) the resulting point-wise inequality, we see that (under the assumptions of Theorem 1.1),
[TABLE]
The following Hilbertian statement is a consequence of (4) that deserves to be stated separately.
Corollary 1.4**.**
Suppose that is a stationary Markov chain whose state space is and with . Let be a Hilbert space. The following bound holds for all , and .
[TABLE]
Corollary 1.4 is nothing more than (4) applied to an isometric copy of in , which is known to exist by [Ban32, Chapter 12] (see also the exposition in, e.g., the textbook [AK16, Proposition 6.4.12]).
Since , the following corollary is a consequence Corollary 1.4 through the usual application of Markov’s inequality and then an optimization over .
Corollary 1.5** (Hilbert space-valued expander Chernoff bound).**
There is a universal constant with the following property. Suppose that is a stationary Markov chain whose state space is and with . Let be a Hilbert space. If , then for all and we have
[TABLE]
Remark 2**.**
Kargin studied [Kar07] the vector-valued setting of Gillman’s theorem for functions that take values in the -dimensional Euclidean space . The statement that is obtained in [Kar07] is the same as that of Corollary 1.5, except that it is dimension-dependent; specifically, with the implicit constant in (6) growing to exponentially with . Thus, the main new feature of Corollary 1.5 is that it is dimension-independent. Obtaining such a bound was a main question that [Kar07] left open; see [Kar07, Section 4].
Observe that estimates such as (4) can be interpreted as bounds on the operator norm of a certain linear operator between vector-valued -spaces. Specifically, suppose that is a Banach space. Let be a stationary Markov chain whose state space is and with . Denote (as before) the stationary measure of by and let the transition matrix of be . For each denote the associated probability measure on the trajectories of length by . Thus, is the probability measure on that is given by if , and for ,
[TABLE]
Define a linear operator by setting for ,
[TABLE]
Here, and in what follows, we are using standard notation for vector-valued Lebesgue–Bochner spaces, though throughout we will need to consider only finitely supported measures, in which case measurability issues do not need to be discussed. So, if is a probability space with , then the Banach space is the vector space of all mapping , equipped with the norm
[TABLE]
The validity of (4) under the assumptions of Theorem 1.1 is the same as the operator norm bound
[TABLE]
In the same vein, Corollary 1.4 is (under the same assumptions) the same as
[TABLE]
By Calderón’s vector-valued extension [Cal64] of the Riesz–Thorin [Rie27, Tho48] interpolation theorem (see the monograph [BL76] for background on complex interpolation; the specific statement that we are using here is a combination of Theorem 4.1.2 and Theorem 5.1.2 in [BL76]), it follows from (8) and (9) that for every we have
[TABLE]
We record this conclusion as the following generalization of Corollary 1.4 and Corollary 1.5.
Corollary 1.6**.**
Suppose that and that is a measure space. Let be a stationary Markov chain whose state space is and with . If , then for all and ,
[TABLE]
Consequently, by the usual combination of (10) with Markov’s inequality, followed by optimization over , there exists a universal constant such that
[TABLE]
Remark 3**.**
By convexity we have , since it is evident from (7) that the operator in question is the difference of two averaging operators. By interpolating this (trivial) estimate with the case of Theorem 1.1 using the (scalar-valued) Riesz–Thorin interpolation theorem as above, we arrive at the following variant of Theorem 1.1 in the range , which holds under the same assumptions.
[TABLE]
Observe that when the Markov chain is reversible, the case of (1) is a quadratic inequality that could be directly verified in a straightforward manner by expanding both sides in an orhtonormal eigenbasis of the transition matrix of . The more substantial content of Theorem 1.1 is therefore the case , which does not lend itself to such linear-algebraic reasoning.**
Remark 4**.**
Both (1) and (12) are sharp (up to the implicit universal constant factors) for large enough . This is seen by examining the following family of Markov chains. For every consider the two-state Markov chain whose transition matrix equals
[TABLE]
where is the -by- identity matrix and . Then and .
The optimality of (1) is exhibited by taking and that is given by . In this case, it is elementary to check that if , then both sides of (1) are within universal constant multiples of each other. Next, the optimality of (12) is exhibited by considering that is given by and . In this case, it is elementary to check that if , then for small enough both sides of (12) are within universal constant multiples of each other. The routine computations that verify these assertions are omitted.**
Remark 5**.**
The above discussion raises the question of understanding what is required from a Banach space so that the “Gillman phenomenon” for stationary Markov chains (or variants thereof) would hold for -valued mappings. The present work obtains the first examples (notably, Hilbert space) of such theorems in infinite dimensions (equivalently, dimension-independent bounds). However, much more remains to be understood here. This matter is pursued in the forthcoming work [Nao19], where it is explained how it relates to central themes in Banach space theory. Further infinite dimensional statements are derived in [Nao19], including a treatment of (10) in the range which is not covered in Corollary 1.6, through an approach that is entirely different from our reasoning here. **
We end the Introduction by noting that the above results have an equivalent dual formulation that is worthwhile to work out explicitly. Given a Banach space , the operator that is given in (7) has norm from to if and only if its adjoint has norm from to , where . This leads to the following dual formulation of Corollary 1.6, whose derivation is a mechanical unravelling of the definitions (the straightforward details are omitted).
Corollary 1.7** (adjoint of (10)).**
Let be a stationary Markov chain whose state space is and with . Fix and with . For every measure space and ,
[TABLE]
2 Proof of Theorem 1.1
Suppose from now on that we are in the setting of Theorem 1.1. We will write for simplicity and . We will also let be the transition matrix of .
It suffices to prove (1) when satisfies . Indeed, this could be then applied to the centered function to yield the estimate
[TABLE]
where the last step is the triangle inequality in . So, assume from now on that . It will be convenient to define by setting for all . The assumption on becomes . Below, we will denote the diagonal matrix whose diagonal is by , i.e.,
[TABLE]
Lemma 2.1**.**
For every we have
[TABLE]
Proof.
Let be the set of all those vectors in that satisfy . Observe that by the Markov property and stationarity, for every we have the following identity.
[TABLE]
So, by expanding the ’th power of and arranging the indices in increasing order,
[TABLE]
Remark 6**.**
It is worthwhile to note in passing that while the proof of Lemma 2.1 relies on what may seem to be innocuous identities, the crucial step that rearranged the factors so that their indices are increasing is inherently commutative, and this is what obstructs the direct use of the ensuing proof for matrix-valued functions, namely the setting of [WX08, GLSS18]; alternative routes are taken in [GLSS18, Nao19] but it would be interesting to investigate if a more careful reasoning along the lines of the present work could be used to treat the setting of functions that take values in Schatten–von Neuman trace classes.**
Towards bounding from above each of the terms from Lemma 2.1, we record the following iterative application of Hölder’s inequality and the definition of operator norms.
Lemma 2.2**.**
Fix and . Then, for every we have
[TABLE]
Proof.
Suppose that satisfy . We claim that
[TABLE]
where are defined by . The proof of (15) is by induction on .
The case is tautological. For the induction step, since , by Hölder’s inequality,
[TABLE]
By the definition of the operator norm we have,
[TABLE]
Now (15) follows by combining (16) and (17) with the inductive hypothesis.
Choose and . So,
[TABLE]
and . Hence, with this specific setting of the parameters the bound (15) becomes
[TABLE]
It remains to note that since we have , and therefore . ∎
Fix . Throughout what follows, it will be notationally convenient to consider each Boolean vector as an infinite vector in whose entries vanish on , namely we use the convention for and . Let be all those Boolean vectors of length with no two consecutive [math]s, and with , i.e.,
[TABLE]
For each and that satisfy , we define a quantity in the following way. Consider the consecutive run of s in to which belongs, and let and be the first and last indices of this run, respectively. Formally,
[TABLE]
With this notation, write
[TABLE]
Lemma 2.3**.**
For every ,
[TABLE]
Proof.
For each , write and . Observe that
[TABLE]
Indeed, if , then either , in which case , or for some , in which case , where both identities are equivalent to the assumption . Now,
[TABLE]
Fix and let be all of the indices at which vanishes. Define by setting
[TABLE]
and
[TABLE]
for . Using the fact that for every , we have the following identity.
[TABLE]
Consequently,
[TABLE]
Next, by Lemma 2.2 with and we have
[TABLE]
In the same vein, for every ,
[TABLE]
and also
[TABLE]
We therefore have
[TABLE]
By substituting (24) into (23) and then substituting the resulting estimate into (22), we arrive at (20). ∎
In light of Lemma 2.1, the following lemma is highly relevant to our goal of proving Theorem 1.1.
Lemma 2.4**.**
Suppose that satisfies . Then,
[TABLE]
Proof.
Fix and denote for every . Then,
[TABLE]
where the last step of (26) is an application of Lemma 2.3.
Fixing , note that since is stochastic and the columns of are constant, and also since is -stationary. Consequently . So, for every ,
[TABLE]
By definition, . As and are averaging operators, by convexity and the triangle inequality for all . By the Riesz–Thorin interpolation theorem [Rie27, Tho48] (see e.g. Chapter IV in the textbook [Kat04]), this implies that
[TABLE]
A substitution of (28) into (27), followed by a substitution of the resulting bound into (26) shows that in order to prove the desired inequality (25) it suffices to establish the following estimate.
[TABLE]
where for every and such that , we denote
[TABLE]
Fix some . Denote and . Thus and by the definition of we have . With this notation, we have the following bound.
[TABLE]
By the elementary inequality , which holds for every , it follows from this that
[TABLE]
where the last step follows from a straightforward application of Stirling’s formula. Consider the function that is given by . Then . Hence, is increasing on the interval . But , by the assumption on in the statement of Lemma 2.4. Hence , and therefore
[TABLE]
We will show next that
[TABLE]
In combination with (31) and (32), this would imply the desired inequality (29) because .
For each with (i.e., the consecutive run of s in to which belongs is of length at most ), we have and therefore its contribution to the product in (33) is at most . So, (33) holds if there are no runs of s in of length greater than . Otherwise, there is exactly one run of s in of length , and its contribution to the product in (33) equals
[TABLE]
where the last step follows from Stirling’s formula. This proves our goal (33). ∎
Completion of the proof of Theorem 1.1.
By the triangle inequality in (and stationarity) we have
[TABLE]
This bound implies the desired estimate (1) when , so we may assume from now on that . Let be the largest integer such that . Then, , so the conclusion of Lemma 2.4 holds for both and . By Lemma 2.1 (and Stirling’s formula), this gives
[TABLE]
and similarly
[TABLE]
As in (14), it follows from these bounds (which we derived under the assumption ) that the norm of the operator that is given in (7) is bounded by a universal constant multiple of both from to and from to . Since , another application of the Riesz–Thorin theorem gives that the norm of from to is also bounded by a universal constant multiple of . This is precisely the desired bound (1). ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AK 16] F. Albiac and N. J. Kalton. Topics in Banach space theory , volume 233 of Graduate Texts in Mathematics . Springer, [Cham], second edition, 2016. With a foreword by Gilles Godefory.
- 2[Ban 32] S. Banach. Théorie des opérations linéaires. , volume 1. PWN - Panstwowe Wydawnictwo Naukowe, Warszawa, 1932.
- 3[BL 76] J. Bergh and J. Löfström. Interpolation spaces. An introduction . Springer-Verlag, Berlin-New York, 1976. Grundlehren der Mathematischen Wissenschaften, No. 223.
- 4[Cal 64] A.-P. Calderón. Intermediate spaces and interpolation, the complex method. Studia Math. , 24:113–190, 1964.
- 5[CLLM 12] K. Chung, H. Lam, Z. Liu, and M. Mitzenmacher. Chernoff-Hoeffding bounds for Markov chains: Generalized and simplified. In STACS , pages 124–135. 2012. ar Xiv:1201.0559 .
- 6[Din 95] I. H. Dinwoodie. A probability inequality for the occupation measure of a reversible Markov chain. Ann. Appl. Probab. , 5(1):37–43, 1995.
- 7[FJS 18] J. Fan, B. Jiang, and Q. Sun. Hoeffding’s lemma for Markov chains and its applications to statistical learning, 2018.
- 8[Gil 93] D. Gillman. A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993) , pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA, 1993. doi: 10.1109/SFCS.1993.366819 .
