Modified log-Sobolev inequalities and two-level concentration
Holger Sambale, Arthur Sinulis

TL;DR
This paper establishes that modified log-Sobolev inequalities lead to two-level concentration inequalities, applicable in continuous and discrete settings, and demonstrates their use in proving Talagrand's inequality and analyzing fluctuations of statistics.
Contribution
It introduces a general framework connecting modified log-Sobolev inequalities to two-level concentration and applies it to symmetric groups and hypercube slices.
Findings
Derived two-level concentration inequalities from mLSI.
Proved Talagrand's convex distance inequality using mLSI.
Obtained fluctuation orders consistent with CLTs for known statistics.
Abstract
We consider a generic modified logarithmic Sobolev inequality (mLSI) of the form for some difference operator , and show how it implies two-level concentration inequalities akin to the Hanson--Wright or Bernstein inequality. This can be applied to the continuous (e.\,g. the sphere or bounded perturbations of product measures) as well as discrete setting (the symmetric group, finite measures satisfying an approximate tensorization property, \ldots). Moreover, we use modified logarithmic Sobolev inequalities on the symmetric group and for slices of the hypercube to prove Talagrand's convex distance inequality, and provide concentration inequalities for locally Lipschitz functions on . Some examples of known statistics are worked out, for which we obtain the correct order of fluctuations,…
| function | invariance | mean | limit theorem | |
|---|---|---|---|---|
| H | bi-invariant | |||
| D | right invariant | CLT | ||
| right invariant | CLT | |||
| I | right invariant | CLT |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
spacing=nonfrench
Modified log-Sobolev inequalities and two-level concentration
Holger Sambale1 and Arthur Sinulis1
1Faculty of Mathematics, Bielefeld University, Bielefeld, Germany
{hsambale, asinulis}@math.uni-bielefeld.de
Abstract.
We consider a generic modified logarithmic Sobolev inequality (mLSI) of the form for some difference operator , and show how it implies two-level concentration inequalities akin to the Hanson–Wright or Bernstein inequality. This can be applied to the continuous (e. g. the sphere or bounded perturbations of product measures) as well as discrete setting (the symmetric group, finite measures satisfying an approximate tensorization property, …).
Moreover, we use modified logarithmic Sobolev inequalities on the symmetric group and for slices of the hypercube to prove Talagrand’s convex distance inequality, and provide concentration inequalities for locally Lipschitz functions on . Some examples of known statistics are worked out, for which we obtain the correct order of fluctuations, which is consistent with central limit theorems.
Key words and phrases:
Bernstein inequality, concentration of measure phenomenon, convex distance inequality, Hanson–Wright inequality, modified logarithmic Sobolev inequality, symmetric group
This research was supported by the German Research Foundation (DFG) via CRC 1283 “Taming uncertainty and profiting from randomness and low regularity in analysis, stochastics and their applications”.
1. Introduction
Concentration and one-sided deviation inequalities have become an indispensable tool of probability theory and its applications. A question that arises frequently is to bound the fluctuations of a function of many random variables (or, equivalently, a function on a product space) around its mean, and often times it is possible to prove sub-Gaussian tail decay of the form
[TABLE]
for some , and all . There are various ways to establish sub-Gaussian estimates, such as the martingale method, the entropy method and an information-theoretic approach, and we refer to the monograph [BLM13] for further details.
On the other hand, in some situations it is not possible to prove sub-Gaussian tails, and a suitable replacement might be Bernstein-type
[TABLE]
or Hanson–Wright-type inequalities
[TABLE]
As both inequalities show two different levels of tail decay (the Gaussian one for and an exponential one for ), we use the terminology of Adamczak (see [ABW17, AKPS19]) and call inequalities of these type two-level deviation inequalities. If a similar estimate holds for as well, we refer to these as two-level concentration inequalities.
The purpose of this note is to give a unified treatment of some of the existing literature on two-level deviation and concentration inequalities by showing that these are implied by a modified logarithmic Sobolev inequality (mLSI for short). We prove a general theorem providing two-level deviation and concentration inequalities in various frameworks. In particular, in Section 2, we get back and partially improve a number of earlier results like [BCG17] and [GS20].
We work in a general framework which was introduced in [BG99]. Consider a probability space and let denote the expectation of a random variable with respect to . An operator on a class of bounded, measurable functions is called a difference operator, if
- (1)
for all , is a non-negative measurable function, 2. (2)
for all and we have and .
At first reading, one can think of in the setting . However, we want to stress that we do not require to satisfy a chain rule, and does not need to be an operator in the language of functional analysis.
We say that satisfies a for some , if for all we have
[TABLE]
where () is the entropy functional. This functional inequality is well-known in the theory of concentration of measure and has been used in various works, see [BG99] and the references therein. It is well-known that if satisfies a , we have for any function such that ,
[TABLE]
which is a classical first order concentration of measure result yielding subgaussian concentration (cf. (3.5)). It is not hard to see that the same holds for if for all . Our first goal is to establish second order analogues of (1.2).
1.1. Two-level concentration inequalities
Our first set of results are two-level deviation inequalities for probability measures satisfying a modified logarithmic Sobolev inequality.
Theorem 1.1**.**
Assume that satisfies a for some difference operator and . Let be two measurable functions such that and is sub-Gaussian, i. e. for some , and
[TABLE]
Then for all it holds
[TABLE]
If moreover for all , we have
[TABLE]
One possible way to show sub-Gaussian concentration for in presence of a is by the Herbst argument. This leads to the following corollary.
Corollary 1.2**.**
Assume that satisfies a for some difference operator and . Let be two measurable functions such that and . Then for all we have
[TABLE]
If, again, for all , then the same bound holds for .
By elementary means (cf. (3.1)), the constant can be replaced by any . It is also possible to modify our proofs in order to apply [KZ18, Lemma 1.3], which leads to an inequality of the form
[TABLE]
for some absolute constant (the same one as in [KZ18]). However, this is at the cost of a weaker denominator in the Gaussian term as compared to (1.4), and so we choose to present it in the form of Theorem 1.1.
If the difference operator satisfies a chain rule-type condition, we obtain the following result, especially improving some of the constants above:
Proposition 1.3**.**
Assume that satisfies a for some and some difference operator which satisfies for all positive functions . Let be such that and . For any it holds
[TABLE]
If satisfies for any , the same bound holds for .
We will see a number of examples of such difference operators all along this paper. Obviously, one example is the usual gradient, but also many difference operators involving a positive part satisfy the property in question.
In all the above results, a possible choice of is usually given by , resulting in in the denominator of the Gaussian term. In this case, the second condition reads as , which can be understood as a condition on an iterated (and thus second order) difference of .
In fact, Theorem 1.1 can be understood as a Bernstein-type concentration inequality. Indeed, it is easy to see that for all and we have
[TABLE]
This leads to the following corollary.
Corollary 1.4**.**
In the situation of Theorem 1.1, for all we have
[TABLE]
If for all , then the same bound holds with replaced by .
Let us remark that the use of modified LSIs allows us to prove results for some classes of measures we could not address in previous work (e. g. [GSS18b]), e. g. weakly dependent measures which might not have a finite number of atoms.
Next, we show similar deviation inequalities for an important class of functions, namely self-bounded functions. In our framework, for a difference operator we say that is a self-bounded function, if
[TABLE]
for some constants . For a product measure , there are various sources that provide deviation or concentration inequalities for self-bounded functions, see e. g. [BLM00, Theorem 2.1], [Rio01, Théorème 3.1], [BLM03, Theorem 5], [BBLM05, Corollary 1], [Cha05, Theorem 3.9], [MR06, Theorem 1] and [BLM09, Theorem 1]. As many of the proofs rely on the entropy method, it is not hard to adapt them to obtain Bernstein-type deviation inequalities only requiring an mLSI, which includes many more types of measures also allowing for dependencies:
Proposition 1.5**.**
Assume that satisfies a and let be a self-bounded function. Then for all we have
[TABLE]
If, additionally, for all , then for all it holds
[TABLE]
As we show in Proposition 2.18, product measures always satisfy an mLSI with respect to a certain -type difference operator, which was also used in the works mentioned above. This is a well-known fact and was first proven in [Mas00].
1.2. The symmetric group
One example we especially discuss in this note is the symmetric group equipped with the uniform measure. To this end, we need some notations. We write the group operation on as for , and denote by the transposition of and . We define two difference operators (on ) via
[TABLE]
For our results, we will need that the symmetric group satisfies modified logarithmic Sobolev inequalities with respect to the two difference operators defined above:
Proposition 1.6**.**
Let be the symmetric group equipped with the uniform measure. Then a and a hold.
To formulate our next result, let us recall the notion of observable diameter. In the context of equipped with any metric , we define it by
[TABLE]
For some metrics, this expression can be simplified. We say that a metric is right invariant, if for any we have , and left invariant if . It is bi-invariant, if it is right and left invariant. Assuming that is left (or right) invariant, we have
[TABLE]
We call a function locally Lipschitz with respect to , if for all and we have .
Theorem 1.7**.**
Let be the symmetric group equipped with a metric and be the uniform distribution on . Assume that is locally Lipschitz with respect to . For all it holds
[TABLE]
As a consequence, we have
[TABLE]
For example, Theorem 1.7 can easily recover concentration inequalities for locally Lipschitz functions with respect to the normalized Hamming distance d_{H}(\sigma,\pi)=n^{-1}\sum_{i=1}^{n}\text{\mathbbm{1}}_{\sigma(i)\neq\pi(i)}. In this case, . We work out further examples in Subsection 2.1.
Finally, we give a proof of Talagrand’s famous concentration inequality for the convex distance for random permutations by similar means as used in the proofs of the upper results. To this end, recall that for any measurable space and any , we may define the convex distance of to some measurable set by
[TABLE]
where
[TABLE]
Proposition 1.8**.**
For any it holds
[TABLE]
As compared to Talagrand’s original formulation (see [Tal95, Theorem 5.1]), (1.8) has a weaker absolute constant 144 instead of 16. It is possible to improve our own constant a bit by invoking slightly more subtle estimates but we do not seem to arrive at 16. For product measures, an inequality similar to (1.8) was deduced in [Tal95], a form of which with a weaker constant was proven in [BLM09] with the help of the entropy method. This was extended to weakly dependent random variables in [Pau14]. However, it does not seem possible to adjust the method therein to the case of the symmetric group, and so we are not aware of any proof of either of the inequalities for the symmetric group using the entropy method. In [Sam17] the author has proven the convex distance inequality for the symmetric group using weak transport inequalities.
It is possible to prove a weaker version of (1.8) with a somewhat better constant:
Proposition 1.9**.**
Let be the symmetric group and be the uniform distribution on . For any set with and all we have
[TABLE]
In fact, (1.8) implies (1.9) with a constant of 144 instead of 64.
1.3. Slices of the hypercube
Finally, let us discuss another model for which we are able to prove a convex distance inequality similar to (1.8). Given two natural numbers such that , consider the corresponding slice of the hypercube , and denote by the uniform measure on . On , we define the difference operators
[TABLE]
Here, switches the -th and the -th coordinate of the configuration . Up to the scaling of , is the generator of the so-called Bernoulli–Laplace model.
As in the previous section, a modified logarithmic Sobolev inequality holds:
Proposition 1.10**.**
For as above, a and a hold.
Using this, we may establish a convex distance inequality by means of the entropy method again:
Proposition 1.11**.**
For any it holds
[TABLE]
1.4. Outline
In Section 2 we provide various applications and concentration inequalities. This includes examples of functions on the symmetric group (Section 2.1), concentration inequalities for multilinear polynomials in -valued random variables (Section 2.2), as well as consequences of Theorem 1.1 for the Euclidean sphere and measures on satisfying a logarithmic Sobolev inequality (Section 2.3) and for probability measures (on general spaces) satisfying an mLSI with respect to some “ difference operator” (see Section 2.4). Moreover, in Section 2.5 we recover and extend the classical Bernstein inequality for independent random variables (up to constants).
Section 3 contains all the proofs, both of the results mentioned in this section as well as in Section 2.
2. Applications
Let us now describe various situations which give rise to mLSIs with respect to “natural” difference operators, and show some consequences of the main results.
2.1. Symmetric group
The aim of this subsection is to show how the results from Section 1 can be used to easily obtain concentration inequalities for functions on the symmetric group. In particular, we calculate many examples of statistics for which central limit theorems were proven, and show that the variance proxy of the sub-Gaussian estimate and the true variance agree (up to a constant independent of the dimension). This provides non-asymptotic concentration results, which are consistent with the limit theorems.
First, let us introduce the following natural metrics on :
[TABLE]
Table 1 collects some basic properties of , , and .
Example 2.1**.**
In this example, we calculate the observable diameters of the metrics on the symmetric group introduced above. By Theorem 1.7, this yields concentration properties for (locally) Lipschitz functions.
- (1)
For the Hamming distance it is clear that , which implies . So, Theorem 1.7 recovers a concentration result from [Mau79].
The resulting variance estimate is not always sharp; for example, if we consider the function , the variance is and not of order . On the other hand, the function is a locally Lipschitz function with respect to , which converges weakly to a Poisson random variable. As a consequence, there cannot be an -independent sub-Gaussian estimate in the class of all locally Lipschitz functions. 2. (2)
If we define for a distance on by the induced norm
[TABLE]
this yields . Consequently, recalling that
[TABLE]
for any , we have
[TABLE]
The case gives Spearman’s footrule and Spearman’s rank correlation. 3. (3)
Considering Kendall’s , we can readily see that for two indices and any it holds , since can be brought to by first taking to its place, and then . So, as above, this leads to
[TABLE] 4. (4)
In a more general setting, let be a faithful, unitary representation of and let be a unitarily invariant norm on . Then defines a bi-invariant metric on , and in this case we have
[TABLE]
Example 2.2**.**
Define the random variable . We have
[TABLE]
If we define the matrix via , then the right hand side is (up to the factor ) the squared Hilbert–Schmidt norm of . It is clear that , and one can also easily see that it is invariant under right multiplication with any transposition . As any permutation can be written as a product of transpositions, we can evaluate it at the identity element. Consequently,
[TABLE]
Using (1.2), this leads to the concentration inequality
[TABLE]
Actually, the term is natural, as the variance of is of order (see the table above). Incorporating the variance of into the inequality above leads to
[TABLE]
which yields the correct tail behavior.
Example 2.3**.**
Let us consider the -Lipschitz function . For any we have by (1.7), and Example 2.1 (3)
[TABLE]
which is consistent with the central limit theorem for .
Example 2.4**.**
We define the number of ascents f(\sigma)=\sum_{j=1}^{n-1}\text{\mathbbm{1}}_{\sigma(j+1)>\sigma(j)}. It can be easily shown that for any the number of ascents is not sensitive to transpositions in the sense that . Consequently, this leads to , implying the concentration inequality
[TABLE]
again using (1.2). Alternatively, this also follows from Example 2.1 (1). Again, the variance term of order is of the right order, as in [CKSS72] the authors have shown a central limit theorem for the number of ascents. More precisely, the sequence converges to a standard normal distribution. The above calculations lead to
[TABLE]
Example 2.5**.**
A closely related statistic is given by the sum of the ascents defined as . A short calculation shows
[TABLE]
Indeed, if we let , then
[TABLE]
Now each of the terms , is less than , and the same holds true for the two other sums. Therefore this yields
[TABLE]
[Cla09] has calculated the variance of the sum of ascents, and it is of order , which is in good accordance with the concentration inequality (again, up to the factor).
Example 2.6**.**
Given a matrix of real numbers satisfying , define . By elementary computations one can show , i. e. is self-bounding. As a consequence, Proposition 1.5 leads to
[TABLE]
Concentration inequalities for have been proven using the exchangeable pair approach in [Cha05, Proposition 3.10] (see also [Cha07, Theorem 1.1]), with the denominator being .
For example, if is the identity matrix, is the number of fixed points of a random permutation, which satisfies for all . In this case, converges to a Poisson distribution with mean as (see e. g. [Dia88]).
Example 2.7**.**
Finally, consider the random variable , where g(\sigma)=\sum_{i=1}^{n-1}\text{\mathbbm{1}}_{\sigma(i+1)>\sigma(i)} is the number of descents. In [CD17] the authors calculated the expectation and variance of and proved a central limit theorem. As in the above example one can easily see that , as well as , where denotes the inverse map. Since holds true for any functions , we also have , implying for any
[TABLE]
Again, the variance is of order , so that it is consistent with the CLT.
2.2. Multilinear polynomials in -random variables
The aim of this section is to show Bernstein-type concentration inequalities for a class of polynomials in independent random variables with values in . The functions we consider are constructed as follows: Let be a weighted hypergraph, such that every consists of at most vertices, assume that are independent, -valued random variables, and set
[TABLE]
Define the maximum first order partial derivative as
[TABLE]
Proposition 2.8**.**
Let be independent, -valued random variables and given as in (2.1). Assume that and for all . We have for any
[TABLE]
Furthermore, for it holds
[TABLE]
A slight modification of the proof of Proposition 2.8 also allows to prove deviation inequalities for suprema of such homogeneous polynomials. For example, this can be used to prove the following concentration inequalities for maxima or norms of linear forms.
Proposition 2.9**.**
Let be independent, -valued random variables, and define . For any we have
[TABLE]
In particular, for any it holds
[TABLE]
One possible application of Proposition 2.8 is to understand the finite concentration properties of the so-called d-runs on the line.
Proposition 2.10**.**
Let , be independent, identically distributed random variables with values in and mean . Define the random variable , where the indices are to be understood modulo . For any it holds
[TABLE]
In [RR09, Theorem 4.1], the authors prove a CLT for the -runs on the line for Bernoulli random variables with success probability , by normalizing by . This is also the reason for the choice in inequality (2.4). In other words, under the assumption as , Proposition 2.10 yields sub-Gaussian tails for . This is in good accordance with the aforementioned CLT.
Moreover, note that in this example, our methodology leads to better results than the usual bounded difference inequality. Indeed, the latter only yields
[TABLE]
suggesting an (inaccurate) normalization of by .
Example 2.11**.**
If is the Erdös–Rényi model with parameter , for any fixed graph with vertices and edges, the subgraph counting statistic can be written in the form (2.1) with , and . Furthermore, it is easy to see that for the maximum degree , so that Proposition 2.8 yields
[TABLE]
For example, this gives nontrivial bounds in the triangle case whenever as . This bound is suboptimal, as the optimal decay is known to be , see [Cha12, DK12]. However, it is better than the bound obtained by the bounded differences inequality. In general, if we consider subgraph counting statistics for some subgraph with vertices and edges on an Erdös–Rényi model , the bounded difference inequality yields the estimate
[TABLE]
Thus, to obtain non-trivial estimates in the limit , one has to assume that . With the above inequality, this can be weakened to .
2.3. Derivations
If satisfies the chain rule, i. e. for all differentiable and such that we have , then (1.1) is equivalent to the usual logarithmic Sobolev inequality (in short: )
[TABLE]
Using this, one can derive second order concentration inequalities similar to the ones given in [BCG17] from Proposition 1.3. Let be the unit sphere equipped with the uniform measure . It is known that for
[TABLE]
holds for all Lipschitz functions and the spherical gradient (see [BCG17, Formula (3.1)] for the logarithmic Sobolev inequality, from which the modified one follows as above). To state our next result, we introduce the following notation (which we will stick to for the rest of this paper): if is an matrix, we denote by its Hilbert–Schmidt and by its operator norm.
Proposition 2.12**.**
Consider equipped with the uniform measure and let be a function satisfying . For any
[TABLE]
This follows immediately from Proposition 1.3 and the inequality proven in [BCG17, Lemma 3.1]. Now, if is and orthogonal to all affine functions (in ), [BCG17, Proposition 5.1] shows So, if we additionally have , the estimate
[TABLE]
follows.
In a similar manner, one may address open subsets of equipped with some probability measure satisfying a logarithmic Sobolev inequality (with respect to the usual gradient ). This situation has been sketched in [BCG17, Remark 5.3] and was discussed in more detail in [GS20]. Here we easily obtain the following result:
Proposition 2.13**.**
Let be an open set, equipped with a probability measure which satisfies a , and let be a function satisfying . For any
[TABLE]
For the proof it only remains to note that , cf. [GS20, Lemma 7.2]. As above, if we require the first order partial derivatives to be centered (which translates into orthogonality to linear functions if is the standard Gaussian measure, for instance), a simple application of the Poincaré inequality yields In particular, we have the following corollary which immediately follows from Proposition 1.3 and the Poincaré inequality.
Corollary 2.14**.**
Let be an open set, equipped with a probability measure satisfying a , and be a function with
[TABLE]
For any we have
[TABLE]
Thus, if we recenter a function and its derivatives, the two conditions on the Hessian ensure two-level concentration inequalities. For functions of independent Gaussian vectors, two-level concentration inequalities have been studied in [Wol13] using the Hoeffding decomposition instead of a recentering of the partial derivatives.
Note that (2.6) and Corollary 2.14 do not only recover [BCG17, Theorem 1.1] and [GS20, Theorem 1.4], but even strengthen these results by providing two-level bounds. To illustrate this, we discuss one of the examples from [GS20] in more detail.
Example 2.15** (Eigenvalues of Wigner matrices).**
Let be a family of independent real-valued random variables whose distributions all satisfy a for a fixed . Putting for , we define the random matrix . Then, by a simple argument using the Hoffman–Wielandt theorem, the joint distribution of its ordered eigenvalues on (in fact, a.s.) satisfies a with constant (see for instance [BG10]).
Now consider a -smooth function with first order (partial) derivatives in and second order derivatives bounded by some constant . Considering a quadratic statistic and recentering according to Corollary 2.14, we shall study
[TABLE]
where denote partial derivatives. For instance, if , we have . Simple calculations show that as well as . Here, by we denote suitable absolute constants which may vary from line to line. Following [GS20, Proposition 8.5], this leads to the exponential moment bound
[TABLE]
By Chebyshev’s inequality, for all , thus yielding subexponential fluctuations of order .
By contrast, Corollary 2.14 leads to
[TABLE]
which is much better for large . In particular, the fluctuations in the subexponential regime are of order now. This can be interpreted as an extension of the self-normalizing property of linear eigenvalue statistics to a second order situation on the level of fluctuations (cf. the discussion of [GS20, Proposition 8.5]). Note that in [GS20], a comparable result could be achieved for the special case of only.
2.4. Weakly dependent measures
To continue the discussion of the previous section for a larger class of measures, we will now consider applications of Theorem 1.1 for functions of weakly dependent random variables (which, in our case, essentially means that a certain mLSI with respect to a suitable difference operator is satisfied). Throughout this section, we shall consider probability measures on a product of Polish spaces . For a vector and we let , and for we write . Now we define difference operators on via
[TABLE]
Here, the suprema over (and ) are to be understood with respect to the support of . Clearly, and . Moreover, we need a second order version of the difference operator . To this end, for any , define
[TABLE]
and let be the matrix (“Hessian”) with zero diagonal and entries on the off-diagonal.
We now have the following second order result in presence of a :
Proposition 2.16**.**
Let be a probability measure on a product of Polish spaces satisfying a , and let be a bounded measurable function. If , we have for any
[TABLE]
On the other hand, if for all , we have for all
[TABLE]
Proposition 2.16 implies many second order results from previous articles. For instance, it is well-known (and we will check again below) that any product probability measure satisfies a . Therefore, from (2.7) it is easily possible to obtain results similar to [GS20, Theorem 1.2]. To see this, it suffices to note that for functions with Hoeffding decomposition , one may apply [GS20, Proposition 5.2] to upper bound by . Unlike in [GS20], Proposition 2.16 yields two-level (or Bernstein-type) inequalities, which can be regarded as an advantage of the present approach.
Similarly, we may retrieve (and sharpen) some of the results from further articles like e. g. [GSS18a] for . On the other hand, it seems that requiring modified logarithmic Sobolev inequalities instead of usual logarithmic Sobolev inequalities extends the class of measures to which our results apply, in particular in non-independent situations. We will discuss the property and provide some sufficient conditions in more detail below.
For some classes of functions, we can obtain variants of Proposition 2.16 which are especially adapted to the properties of the functions under consideration. In particular, we may show deviation inequalities for suprema of quadratic forms in the spirit of [KZ18] for the weakly dependent case.
Proposition 2.17**.**
Let be supported in and satisfy a . Let be a countable class of symmetric matrices, bounded in operator norm and with zeroes on its diagonal. Define , and . We have for any
[TABLE]
Note that while in general, we only obtain deviation inequalities here, for a single symmetric matrix with zeroes on its diagonal and the quadratic form similar arguments as in the proof of Proposition 2.16 do lead to concentration inequalities for .
If is a product measure, the result of Proposition 2.17 is well-known and has been proven various times, see for example [Tal96, Theorem 1.2] for concentration inequalities in Rademacher random variables, [Led97, Theorem 3.1] for the upper tail inequalities and random variables satisfying , [BLM03, Theorem 17] for the upper bound and Rademacher random variables and [BBLM05, Corollary 4]. More recent results include [HKZ12, RV13, Ada15, AKPS19, KZ18, GSS18b].
To understand which classes of measures may be addressed by Propositions 2.16 and 2.17, let us study the property in more detail. First, we show that it is implied by another functional inequality. Assume that a probability measure on a product of Polish spaces satisfies
[TABLE]
where denotes the regular conditional probability. This functional inequality is (also) known as a modified logarithmic Sobolev inequality in the framework of Markov processes, and it is equivalent to exponential decay of the relative entropy along the Glauber semigroup, see for example [BT06] or [CMT15].
Proposition 2.18**.**
If satisfies (2.10), then a and a hold. Consequently, for any and any we have
[TABLE]
The same is true for with replaced by . This especially holds for product measures with .
Here, choosing or respectively leads to the exponential inequalities
[TABLE]
The first inequality might be considered as a generalization of [Mas00, Lemma 8], which in turn is based on arguments in [Led97, Theorem 1.2]. The second inequality involving is well-known in the case of the discrete cube, cf. [BG99, Corollary 2.4] with a better constant. On the other hand, the proof presented herein is remarkably short and does not rely on some special properties of the measure , but can be derived under (2.10).
Proposition 2.18 implies [BLM03, Theorem 2], as product measures satisfy (2.10) with . Indeed, taking the logarithms on both sides of (2.11) gives for any and
[TABLE]
It remains to choose some fixed and set .
The property (2.10) is satisfied for a large class containing non-product measures. Note that a sufficient condition (due to Jensen’s inequality) for (2.10) is the approximate tensorization property
[TABLE]
Establishing (2.12) is subject to ongoing research, and we especially want to highlight two possible approaches.
The first one is akin to the perturbation argument of Holley and Stroock as outlined in [HS87] (see also [Roy07, Proposition 3.1.18] for a similar reasoning). Assume that , where is a measurable function, is some product measure and . If we require to be bounded, we clearly have for its (maximal) oscillation . Under these assumptions, satisfies (2.12) with .
Furthermore, under weak dependence conditions on the local specifications of some measure on a product space , (2.12) was proven in [Mar13, Mar15, CMT15].
2.5. Bernstein inequality
As a final application, let us demonstrate how to recover the classical Bernstein inequality for independent bounded random variables by means of Theorem 1.1 (up to constants). In fact, as in some previous works we may remove the boundedness assumption.
There are various extensions of Bernstein’s inequality to unbounded random variables. For instance, [Ada08, Theorem 4] proves deviation inequalities for empirical processes in independent random variables with finite norm for some , which in particular includes concentration inequalities for sums of random variables with finite norm. Moreover, [BLM13, Theorem 2.10] requires a certain control of the moments of the random variables, which is in essence a condition on the norms. Thirdly, [Ver18, Theorem 2.8.1] provides a Bernstein inequality for random variables with bounded norms. However, note that the Gaussian term in the last two mentioned works is a sum of the norm instead of the variance. By our methods, we obtain a version of Bernstein’s inequality for sub-Gaussian random variables with the variance of the sum in the Gaussian term, with a reasonable constant.
Theorem 2.19**.**
There exists an absolute constant such that the following holds. For any set of independent random variables satisfying , we have for any
[TABLE]
In particular, if almost surely for all and some , then for all it holds
[TABLE]
We want to give three concluding remarks on Theorem 2.19. Firstly, note that is not possible to prove an inequality
[TABLE]
for some absolute constant in the class of all sub-Gaussian random variables. This can be easily seen in the case and by choosing for . Thus, to obtain a sub-Gaussian tail with the variance parameter, one has to limit the range of for which one can expect sub-Gaussian behaviour.
Secondly, one cannot replace by in (2.13), i. e. there cannot be an inequality of the form
[TABLE]
This, again, follows by choosing for , . In this case, the sum converges (weakly) to a Poisson random variable, whereas the sub-Gaussian range extends to for , giving a contradiction.
Thirdly, it is well known that the norm of the maximum of random variables (bounded by some constant, say ) grows at most logarithmically in the dimension. For example, if we consider i. i. d. random variables with unit variance, we have the sub-Gaussian estimate for of order (at least) .
3. Proofs and auxiliary results
We begin by proving Theorem 1.1. Before we start, let us recall [BG99, Theorem 2.1], relating the exponential moments of to those of .
Theorem 3.1**.**
Assume that satisfies (1.1) with constant . Then for any and any we have
[TABLE]
Note that formally, Theorem 3.1 and our own results like Theorem 1.1 are valid for bounded functions only, since was defined on a subset of bounded functions. However, it is not hard to see that our proofs can usually be extended to a suitable larger class of functions . One possible approach is first to truncate the random variable under consideration, and then prove bounds which are independent of the truncation level. As this is somewhat situational and depends on the difference operator , we stick to the boundedness assumption for the sake of a clearer presentation of the arguments. Nevertheless, we can prove Theorem 1.1 under the assumption that can be suitably defined for the function at hand, and that for some sub-Gaussian function.
Furthermore, we need an elementary inequality to adjust the constants in concentration or deviation inequalities: for any two constants we have for all and
[TABLE]
whenever the left hand side is smaller or equal to .
Proof of Theorem 1.1.
Assume that , which can always be achieved by defining a new difference operator . The general inequality follows by straightforward modifications from the case.
Making use of Theorem 3.1 in the first and for any in the second inequality, we obtain for all
[TABLE]
The sub-Gaussian condition (1.3) leads to
[TABLE]
whenever . Consequently, for all we obtain by Markov’s inequality
[TABLE]
Now we distinguish the two cases and . In the first case, set (which implies and thus is in the range) to obtain
[TABLE]
using the monotonicity of . In the second case, we simply set (implying ) and observe that
[TABLE]
Combining (3.3) and (3.4) finishes the proof of (1.4).
Finally, (1.5) follows by considering instead of , which yields
[TABLE]
The constant can be adjusted using (3.1). ∎
Proof of Corollary 1.2.
Using the , by applying Theorem 3.1 to , Markov’s inequality and optimizing it can be shown that for all
[TABLE]
Here, to obtain the factor in the denominator, one has to let in Theorem 3.1. Thus, the corollary follows easily from Theorem 1.1. ∎
Proof of Proposition 1.3.
We assume which can be done by rescaling.
First, observe that [BG99, equation (2.4)] holds for any positive function , since the inequality is sufficient to apply the argument given therein. Thus, for any positive function satisfying it holds for
[TABLE]
So, by applying Theorem 3.1 (with we have
[TABLE]
which can also be applied to and instead of and , for . Thus, by Markov’s inequality, for any
[TABLE]
The claim follows by putting and noting that if , we have . ∎
Proof of Proposition 1.5.
Choosing in Theorem 3.1, applying the inequality to and using the monotonicity leads to
[TABLE]
Thus for , by Jensen’s inequality (applied to the concave function ) we have
[TABLE]
Finally, Markov’s inequality and [BLM03, Lemma 11] yield the first inequality.
To see the second inequality, note that for any such that , by Theorem 3.1 and concavity of , it holds
[TABLE]
Finally, applying the estimates from the first part we obtain
[TABLE]
The concentration inequality follows as in the first part. ∎
Proof of Proposition 1.6.
Using and rewriting [GQ03, Theorem 1] we obtain for any
[TABLE]
Now, the inequality and the fact that is an automorphism of leads to the . The follows in the same manner from the inequality . ∎
Proof of Theorem 1.7.
By Proposition 1.6 and Theorem 3.1 we have for any , any and any the inequality
[TABLE]
If is locally Lipschitz with respect to , an easy calculation shows that we can upper bound , so that from the above inequality in combination with we get
[TABLE]
The sub-Gaussian estimate follows by Markov’s inequality and the variance bound from integration by parts. ∎
In order to prove Proposition 1.8, we first need to establish the following lemma:
Lemma 3.2**.**
Let be a non-negative function such that
- (1)
, 2. (2)
* for all .*
Then for all we have
[TABLE]
Especially we have
[TABLE]
In particular, this holds for , where is any set.
Proof of Lemma 3.2.
Rewriting [GQ03, Theorem 1], we have that for any positive function ,
[TABLE]
Using this, we obtain for any
[TABLE]
where . By a Taylor expansion it can easily be seen that for all , so that (recall that by we have , and due to the positive part)
[TABLE]
Chebyshev’s association inequality yields
[TABLE]
In other terms, if we set , we have
[TABLE]
which by the fundamental theorem of calculus implies for all
[TABLE]
So, for any , by Markov’s inequality and setting
[TABLE]
The second part follows by nonnegativity and .
It remains to show that satisfies the two conditions of this lemma. To this end, we first need to show that . Writing , it is well known (see [BLM03]) that we have
[TABLE]
where is the set of all probability measures on . To estimate , one has to compare and . To this end, for any fixed, let be parameters for which the value is attained, and let be a minimizer of . This leads to
[TABLE]
Using this and the non-negativity of , we have
[TABLE]
To show the second property, we proceed similarly to [BLM09, Proof of Lemma 1]. By (3.7) and the Cauchy–Schwarz inequality, we have
[TABLE]
Assuming without loss of generality that , choose such that the value of is attained. It follows that
[TABLE]
which finishes the proof. ∎
The proof of Proposition 1.8 is now easily completed:
Proof of Proposition 1.8.
The difference operator satisfies for all positive functions , as well as an . Moreover, as seen in the proof of Lemma 3.2, we have . Thus, by (3.6) it holds for
[TABLE]
Furthermore, Lemma 3.2 shows that
[TABLE]
So, for we have
[TABLE]
∎
Proof of Proposition 1.9.
Again, the proof mimics the proof given for independent random variables in [BLM03]. As stated in Proposition 1.6, the uniform measure on satisfies a with respect to
[TABLE]
Writing , we have as seen in the proof of Lemma 3.2. Hence, by similar arguments as in the proof of Theorem 1.1 we have for any
[TABLE]
implying the sub-Gaussian estimate Fix a set satisfying . As a implies a Poincaré inequality (see [BT06, Proposition 3.5] or [DS96]), we also have (by Chebyshev’s inequality)
[TABLE]
which evaluated at yields . Thus, for any it holds
[TABLE]
where the last inequality follows from for any and (3.1). For the inequality (3.9) holds trivially. ∎
The proofs of the results for slices of the hypercube work in a very similar way.
Proof of Proposition 1.10.
It follows from [GQ03, Theorem 1] that we have for any
[TABLE]
From here, we may process as in the proof of Proposition 1.6. ∎
For the proof of Proposition 1.11, we need to establish the following analogue of Lemma 3.2:
Lemma 3.3**.**
Let be a non-negative function such that
- (1)
, 2. (2)
* for all .*
Then for all we have
[TABLE]
Especially we have
[TABLE]
In particular, this holds for , where is any set.
Proof of Lemma 3.3.
Rewriting [GQ03, Theorem 1], we have that for any positive function ,
[TABLE]
From here, we may mimic the proof of Lemma 3.2.
Last, we need to show that satisfies the two conditions of this lemma. As compared to the proof of Lemma 3.2, some of the constants will change because of the different normalization of the difference operators. However, we may argue similarly and show that . Using this and the non-negativity of yields
[TABLE]
Finally, by arguing as above it is easily seen that . ∎
Proof of Proposition 1.11.
As the difference operator satisfies for all positive functions , as well as an , it remains to change the proof of Proposition 1.8 in view of the different constants appearing in Lemma 3.3. As noted in the proof of Lemma 3.3, we have . Thus, by (3.6) it holds for
[TABLE]
Furthermore, Lemma 3.3 shows that
[TABLE]
So, for we have
[TABLE]
∎
Finally, we present the proofs of Section 2.
Proof of Proposition 2.8.
We show that is weakly -self bounding in the language of [BLM09]. To see this, for any let . Now we have
[TABLE]
Here, the first inequality follows from and the last one is a consequence of Euler’s homogeneous function theorem and the fact that all quantities involved are positive. Consequently, [BLM09, Theorem 1] yields for any
[TABLE]
For the lower bound, apply [BLM09, Theorem 1] to which satisfies for all and and is weakly -self bounding. ∎
Proof of Proposition 2.9.
The first part follows as above. As for the second part, if we choose for some this leads to
[TABLE]
for the Hölder conjugate , which is due to the nonnegativity of the and the dual formulation of the norm in . ∎
Proof of Proposition 2.10.
Clearly, is -homogeneous and has positive weights in the sense of (2.1), if we set and , . Furthermore, the partial derivatives can be easily bounded: For any fixed there are exactly terms which depend on , and the product is bounded by . Consequently, Thus, Proposition 2.8 yields for all
[TABLE]
The assertion now follows, if we note that . ∎
Let us now prove the results from Section 2.4. To this end, we first need to establish some basic properties of modified logarithmic Sobolev inequalities with respect to the difference operators we use.
Lemma 3.4**.**
Let be a probability measure on a product of Polish spaces which satisfies a . Then, also satisfies a .
Proof.
Let be a probability space and a measurable function on it. Then,
[TABLE]
Applying this to and for any yields
[TABLE]
which finishes the proof. ∎
Also note that by monotonicity a implies an , and the same holds for and . Moreover, we recall the duality formula .
Proof of Proposition 2.16.
First, (2.7) follows by applying Theorem 1.1 to and noting that for all . To see that is sub-Gaussian with parameter and , note that by Lemma 3.4, satisfies a , so that we can use (3.5).
The same arguments are valid for and respectively. Here, we additionally use the estimate (cf. [GSS18b, Lemma 3.2]). ∎
Proof of Proposition 2.17.
Let us bound . Choose the matrix maximizing and use the monotonicity of to obtain
[TABLE]
Furthermore, we have for some maximizer of and for
[TABLE]
Here, the suprema of and are taken over the -dimensional sphere. We can now apply Corollary 1.2 to , , and to finish the proof. ∎
Proof of Proposition 2.18.
The idea of the proof of the s is already present in [BG07]. Let be any probability space. For any function we have due to the inequality (for all )
[TABLE]
Applying this to and and using (2.10) yields
[TABLE]
To see that also satisfies a , it remains to apply Lemma 3.4. The exponential inequalities are a consequence of Theorem 3.1. ∎
Proof of Theorem 2.19.
Write . Let us assume that for all , from which the general case follows easily using the inequality
[TABLE]
Since the are independent, it follows from Proposition 2.18 that their joint distribution satisfies a , and we can calculate
[TABLE]
To apply Theorem 1.1, it remains to show that we may set and . This is seen by noting that
[TABLE]
where the last step follows from [KZ18, Lemma 1.4], as is a convex and -Lipschitz function. Note that although [KZ18, Lemma 1.4] is formulated for , one can easily find an estimate for all , by first multiplying the right hand side by , and then adjusting the constant in the exponential. ∎
Recall that as discussed above, the application of Theorem 3.1 is only possible for bounded functions, so that an additional truncation step needs to be done. Instead of applying Theorem 3.1 to , it is applied to the sum of the random variables for for a suitable truncation level . As the right hand side of equation (2.13) can be chosen to be independent of , the theorem follows for unbounded random variables by letting .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Ada 08] Radosław Adamczak “A tail inequality for suprema of unbounded empirical processes with applications to Markov chains” In Electron. J. Probab. 13 , 2008, pp. no. 34 \bibrangessep 1000–1034 DOI: 10.1214/EJP.v 13-521 · doi ↗
- 2[Ada 15] Radosław Adamczak “A note on the Hanson-Wright inequality for random vectors with dependencies” In Electron. Commun. Probab. 20 , 2015, pp. no. 72 \bibrangessep 13 DOI: 10.1214/ECP.v 20-3829 · doi ↗
- 3[ABW 17] Radosław Adamczak, Witold Bednorz and Paweł Wolff “Moment estimates implied by modified log-Sobolev inequalities” In ESAIM Probab. Stat. 21 , 2017, pp. 467–494 DOI: 10.1051/ps/2016030 · doi ↗
- 4[AKPS 19] Radosław Adamczak, Michał Kotowski, Bartłomiej Polaczyk and Michał Strzelecki “A note on concentration for polynomials in the Ising model” In Electron. J. Probab. 24 , 2019, pp. no. 42 \bibrangessep 1–22 DOI: 10.1214/19-EJP 280 · doi ↗
- 5[BCG 17] Sergey G. Bobkov, Gennadiy P. Chistyakov and Friedrich Götze “Second-order concentration on the sphere” In Commun. Contemp. Math. 19.5 , 2017 DOI: 10.1142/S 0219199716500589 · doi ↗
- 6[BG 99] Sergey G. Bobkov and Friedrich Götze “Exponential integrability and transportation cost related to logarithmic Sobolev inequalities” In J. Funct. Anal. 163.1 , 1999, pp. 1–28 DOI: 10.1006/jfan.1998.3326 · doi ↗
- 7[BG 07] Sergey G. Bobkov and Friedrich Götze “Concentration inequalities and limit theorems for randomized sums” In Probab. Theory Related Fields 137.1-2 , 2007, pp. 49–81 DOI: 10.1007/s 00440-006-0500-9 · doi ↗
- 8[BG 10] Sergey G. Bobkov and Friedrich Götze “Concentration of empirical distribution functions with applications to non-i.i.d. models” In Bernoulli 16.4 , 2010, pp. 1385–1414 DOI: 10.3150/10-BEJ 254 · doi ↗
