A Bernstein-type inequality for functions of bounded interaction
Andreas Maurer

TL;DR
This paper introduces a Bernstein-type concentration inequality for functions of independent variables with bounded interaction, extending classical bounds to more complex functions and improving results for U-statistics and regularized least squares.
Contribution
It provides a new distribution-dependent concentration inequality that generalizes Bernstein's inequality to functions with limited interaction among variables.
Findings
Sharper bounds for U-statistics
Improved generalization error estimates for regularized least squares
Extension of Bernstein's inequality to complex functions
Abstract
We give a distribution-dependent concentration inequality for functions of independent variables. The result extends Bernstein's inequality from sums to more general functions, whose variation in any argument does not depend too much on the other arguments. Applications sharpen existing bounds for U-statistics and the generalization error of regularized least squares.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Bernstein-type inequality for functions of bounded interaction
Andreas Maurer
Adalbertstr. 55, D-80799 Munich, Germany
am”at”andreas-maurer.eu
Abstract
We give a distribution-dependent concentration inequality for functions of independent variables. The result extends Bernstein’s inequality from sums to more general functions, whose variation in any argument does not depend too much on the other arguments. Applications sharpen existing bounds for U-statistics and the generalization error of regularized least squares.
1 Introduction
If are independent real random variables, with almost surely, and , then Bernstein’s inequality [2] asserts that for
[TABLE]
where is the respective variance of . In this work we extend Bernstein’s inequality to more general functions .
This extension requires two modifications. First the variance is replaced by the Efron-Stein upper bound, or jackknife estimate, of the variance. Secondly a correction term is added to the coefficient of in the denominator of the exponent. This correction term, which we call the interaction functional of , vanishes for sums and represents the extent to which the variation of in any given argument depends on other arguments.
To proceed we introduce some notation and conventions. Let be some product of measurable spaces and let be the algebra of all bounded, measurable real valued functions on . For fixed and define the substitution operator and the difference operator on by
[TABLE]
and . Both and are independent of .
Let a probability measure be given on each and let be the product measure on . For the expectation and variance are defined as and . For the conditional expectation and the conditional variance are operators on , which act on a function as
[TABLE]
where is the product measure on . The sum of conditional variances (SCV) operator is defined as
[TABLE]
This operator appears in the Efron-Stein inequality ([7],[15], see also Section 2.4) as
[TABLE]
which becomes an equality if is a sum of real valued functions on . It also appears in the following exponential tail bound (see McDiarmid [11], Theorem 3.8, or [14], Theorem 11).
Theorem 1
Suppose that satisfies for all . Then
[TABLE]
This inequality reduces to Bernstein’s inequality if is a sum, but it suffers from the worst-case choice of the configuration , for which is evaluated. The supremum in is a hindrance to estimation of the variance term, and we would like to replace it by an expectation, just as in the Efron-Stein inequality.
This replacement is trivially possible when is a sum, because then is constant. It turns out that it is also possible if has the right properties of concentration about its mean - a surrogate of being constant, so to speak. To insure this we control the interaction between the different arguments of , in the sense that the variation in any argument must not depend too much on the other arguments.
Definition 2
The interaction functional is defined by
[TABLE]
The distribution-dependent interaction functional is defined by
[TABLE]
These quantities are related and bounded using the inequalities
[TABLE]
(see the end of section 2.3). For our applications below the last, simplest and crudest bound appears to be sufficient. The above functionals and bounds vanish for sums and are positive homogeneous of degree one. The following is our main result.
Theorem 3
Suppose satisfies for all . Then for all
[TABLE]
Remarks:
-
If this is applied to sums of independent random variables (real valued functions defined on ), we recover Bernstein’s inequality.
-
Consider the case that , and a sequence of functions , such that (for example if is bounded) and such that the limit exists. Applying Theorem 3 to the sequence , and letting , we obtain the tail of a normal distribution with variance . In some cases, like U-statistics, this is known to be the correct limiting distribution (Hoeffding [8], Theorem 7.1).
-
Although the distribution dependent functional is potentially much smaller than , in the applications considered sofar it seems sufficient to consider or the above bounds thereof.
-
Since , the variance term above can never be larger than the variance term in Theorem 1, which in turn can never be larger than what we get from the bounded difference inequality (McDiarmid [11], Theorem 3.7, or Boucheron et al [5], Theorem 6.5).
-
If also , then the result can be applied to so as to obtain a two-sided inequality.
In Theorem 2.1 of [9] Christian Houdré bounds the bias in the Efron-Stein inequality in terms of iterated jackknive estimates of variance, which correspond to the expectations of higher order differences. The second of these iterates can be bounded in terms of the interaction functional and allows us to put the variance back into the inequality of Theorem 3.
Proposition 4
[TABLE]
See Section 2.4 for the proof. In combination with Theorem 3 we obtain the following corollary.
Corollary 5
Suppose and for all . Then for all
[TABLE]
We apply Theorem 3 in two seemingly very different situations.
For U-statistics with bounded, symmetric kernels it is surprisingly easy to bound the interaction functional, and an application of Theorem 3 leads to the following concentration result.
Theorem 6
If is a probability measure on and on and is a measurable, symmetric (permutation invariant) kernel with , and is defined by
[TABLE]
then for
[TABLE]
A similar bound given by Arcones ([1], Theorem 2) is
[TABLE]
For large , or deviation the bound in Theorem 6 is the smaller one of the two. Already for order it gives an improvement if . For order the crossover is already at , for order at .
In a completely different context Theorem 3 can be applied to sharpen a stability based generalization bound for regularized least squares (RLS).
Let be the unit ball in a separable, real Hilbertspace, and let . Fix . For regularized least squares returns the vector
[TABLE]
Let be a vector of independent random variables with values in , where is identically distributed to . We can apply Theorem 3, to obtain tailbounds for the random variable , where the ”true error” and the ”empirical error” are defined on by
[TABLE]
We can prove the following result.
Theorem 7
There is an absolute constant such that for every
[TABLE]
Solving for with a fixed bound on the probability we obtain that with probability at least in
[TABLE]
It can be shown ([6]) that the expectation is of order , so for large sample sizes the generalization error is dominated by the variance term, which may be considerably smaller than the distribution-independent bound obtained from the bounded difference inequality as in [6] (it can never be larger because of Remark 4 above). Using techniques as in [13] this term can in principle be estimated from a sample and the estimate combined with the above to a purely data-dependent bound.
A major drawback here is the dependence on in the last term, because in practical applications the regularization parameter typically decreases with . The is likely due to a very crude method of bounding by differentiation. A more intelligent method might give .
It seems plausible that similar bounds exist for Tychonov regularization with other more general loss functions having appropriate properties.
The idea of using second differences (as in the definition of ) has been put to work by Houdré [9] to estimate the bias in the Efron-Stein inequality. The entropy method, which underlies our proof of Theorem 3, has been developed by a number of authors, notably Ledoux [10] and Boucheron, Lugosi and Massart [3]. The latter work also introduces the key-idea of combining it with the decoupling method used below. Our proof follows a thermodynamic formulation of the entropy method as laid out in [14].
The next section gives a proof of Theorem 3. Then follow the applications to U-statistics and ridge regression.
2 Proof of Theorem 3
The proof of our main result, Theorem 3, uses the entropy method ([10], [3],[5]), from which the next section collects a set of tools. These results are taken from [14], which gives more detailed proofs and additional motivation. For the benefit of the reader, and to make the paper more self-contained, corresponding proofs are also given in a technical appendix.
2.1 Definitions and tools
and are as in the introduction, is the subalgebra of of those bounded, measurable functions on which are independent of the -th coordinate. For and define the expectation functional on by
[TABLE]
where . The entropy of at is given by
[TABLE]
where is the Kullback-Leibler divergence.
Lemma 8
(Theorem 1 in [14]) For any and we have
[TABLE]
and, for ,
[TABLE]
Define the real function by .
Lemma 9
(Lemma 10 in [14]) Let satisfy for all . Then for
[TABLE]
Bounding and using Lemma 8 quickly leads to a proof of Theorem 1. For Theorem 3 we need more tools.
Definition 10
The operator is defined by
[TABLE]
To clarify: is the member of defined by . It does not depend on , so .
Lemma 11
(Lemma 15 in [14], also Proposition 5 in [12]) We have, for , that
[TABLE]
We use this to derive the following property of weakly self-bounded functions, which, together with Proposition 17 below, gives the concentration property of alluded to in the introduction.
Lemma 12
Suppose that
[TABLE]
Then for
[TABLE]
Proof. Using Lemma 8 and Lemma 11 and the weak self-boundedness assumption (2) we have for that
[TABLE]
where the last identity follows from the fact that . Thus
[TABLE]
and rearranging this inequality for establishes the claim.
We also use the following decoupling technique: If and are two probability measures and is absolutely continuous w.r.t. then it is easy to show that
[TABLE]
Applying this inequality when is the measure we obtain the following
Lemma 13
We have for any that
[TABLE]
2.2 A concentration inequality
We now use the tools of the previous section to prove an intermediate concentration inequality (Proposition 16) in the case that satisfies the self-bounding hypothesis of Lemma 12. In the next section we show that this condition is satisfied if is taken equal to the interaction functional , and together the two results then give Theorem 3.
We need two more auxiliary results. Recall the definition of the function .
Lemma 14
For any and we have
(i) and
(ii)
[TABLE]
Proof. If and then . In this case we have the two convergent power series representations
[TABLE]
Now by inspection and for
[TABLE]
so that for all non-negative . Term by term comparison of the two power series gives
[TABLE]
which is (ii) in the case that .
It also gives us for general that
[TABLE]
since . This proves (i).
(ii) is equivalent to
[TABLE]
To complete the proof it suffices by (5) to show that the right hand side above is, for fixed a non-decreasing function of . Let , and , so the expression in question becomes . Calculus gives
[TABLE]
But by assumption. Also by (i) and, using (6),
[TABLE]
The expression is therefore non-decreasing in .
We finally need an optimization lemma
Lemma 15
Let and denote two positive real numbers, . Then
[TABLE]
The proof of this lemma can be found in [12] (Lemma 12).
Proposition 16
Suppose that is such that , , and that
[TABLE]
with . Then for all
[TABLE]
Proof. By a simple limiting argument we may assume that . Now let . By Lemma 14 (i) and also . By Lemma 9
[TABLE]
where the second inequality follows from Lemma 13. Subtracting , multiplying by and using Lemma 12 together with the assumed self-boundedness of gives us
[TABLE]
which holds, since . Since we can divide by to rearrange and then use the definition of to obtain
[TABLE]
By Lemma 14 (ii) for
[TABLE]
and from Lemma 8
[TABLE]
where we used Lemma 15 in the last step.
2.3 Self-boundedness of the sum of conditional variances
We record some obvious, but potentially confusing properties of the substitution operator. For and the operator is a homomorphism of and the identity on . If it commutes with and with . Most importantly
[TABLE]
Note however that for we get and and , because , and map to .
Proposition 17
We have for any .
Proof. Fix . Below all members of are understood as evaluated on . For let be a minimizer in of (existence is assumed for simplicity, an approximate minimizer would also work), so that
[TABLE]
where we used the fact that , because . Then
[TABLE]
This step gave us a sum over , which is important, because it allows us to use the commutativity properties mentioned above. Then, using , we get
[TABLE]
by an application of Cauchy-Schwarz. Now, using , we can bound the last sum independent of by
[TABLE]
so that
[TABLE]
Theorem 3 for the case is obtained by substituting for in Proposition 16. The general case follows from rescaling and the homogeneity properties of and .
Of the inequalities in (1) only the first one is not completely obvious:
[TABLE]
In the last inequality we used the fact that the variance of a random variable is bounded by a quarter of the square of its range, so that for all .
2.4 The Bias in the Efron-Stein inequality
Since the published work of Houdré [9] assumes symmetric functions and iid data, we give an independent derivation.
Let be independent variables with distributed as in , and let be independent copies thereof. Denote and and
[TABLE]
We also write for but with the variable removed.
Let satisfy . Then, writing as a telescopic series, we get
[TABLE]
where the last identity is obtained by exchanging and . This gives the nice variance formula
[TABLE]
appearantly due to Chatterjee. The Cauchy-Schwarz inequality then gives the Efron-Stein inequality
[TABLE]
Now we look at the bias in this inequality.
Theorem 18
With above conventions we have
[TABLE]
The proof uses Chatterjee’s formula (8) twice. First we establish a lemma, which itself already uses the Efron Stein inequality.
Lemma 19
[TABLE]
Together with the Efron Stein inequality (9) this gives the attractive chain of inequalities
[TABLE]
Proof of Lemma 19. By induction on . Recall the total variance formula
[TABLE]
With this gives the case . For we get
[TABLE]
where we used the Efron-Stein inequality (9). This is where independence comes in and gives us the case . Suppose now that the lemma holds for . Then
[TABLE]
where the first inequality follows from the induction hypothesis, and the second inequality follows from applying the case to the two random variables and .
Now we tackle the bias in the Efron Stein inequality. The strategy is to first use Chatterjee’s variance formula on each individual term on the right hand side of (9) and then sum the results.
The only difficulty here is notational because we now need more shadow variables. We deal with this problem by augmenting the vectors and to become dimensional.
Proof of Theorem 18. First fix an index and observe that depends on independent variables. We introduce variables which is iid to and an independent copy thereof, and consider correspondingly augmented vectors and with independent components. We also introduce functions defined by
[TABLE]
and . Then . Now we use Chatterjee’s formula (8) with replaced by and replaced by . We obtain
[TABLE]
Since does not depend on we have
[TABLE]
The last identity follows from the definition of the function . Since does not depend on we have
[TABLE]
Substituting these identities in (10), dividing by and summing over gives
[TABLE]
In the inequality we bounded the first term with Cauchy-Schwarz. The second term is equal to by Chatterjee’s formula (8), and the last term is bounded by using Lemma 19.
Proposition 4 is an immediate consequence of Theorem 18.
3 Application to U-statistics
In this section we prove Theorem 6, which simplifies with some notation. If is a set and , then denotes the set of all those subsets of which have cardinality . Also, if and , we use to denote the vector , where and the are increasingly ordered. For we use and to denote respectively the vectors and . With this notation
[TABLE]
We also need a combinatorial lemma.
Lemma 20
For
[TABLE]
Proof. Clearly
[TABLE]
Now
[TABLE]
Then we rewrite the enumerator using
[TABLE]
to get
[TABLE]
Proof of Theorem 6. With reference to any given , and using the symmetry of ,
[TABLE]
This gives
[TABLE]
because takes values in an interval of diameter . This allows to apply Theorem 3 with .
Next we bound the interaction functional . For , and and we get
[TABLE]
so that
[TABLE]
Theorem 3 then gives us
[TABLE]
To bound we will write as a sum of two sums, where the first sum is over disjoint pairs , and the second sum is over intersecting pairs. If and are disjoint, then, since all the are equal to ,
[TABLE]
On the other hand we can use Lemma 20 to bound the number of intersecting pairs and obtain
[TABLE]
Summing over , dividing by and inserting in (11) gives us
[TABLE]
Converting to a two sided bound gives the result.
Instead of Theorem 3 to obtain (11) we could have used Corollary 5 and appealed to known results about (as in [8]).
4 Application to ridge regression
In this section we prove Theorem 7. The key to the application of Theorem 3 is the following Lemma ( denoting the cone of nonnegative definite operators in ).
Lemma 21
Let and be both twice continuously differentiable, satisfying the conditions , , , , and for real numbers and . For define a function by
[TABLE]
Then is twice differentiable and
[TABLE]
Proof. A standard argument shows that (we use for the operator norm and for vectors in , depending on context) and that
[TABLE]
so
[TABLE]
Then
[TABLE]
This gives (13). Also, using the fact that the mixed partials vanish by assumption,
[TABLE]
which gives (14).
Proof of Theorem 7. It is well known and easily verified that is well defined and explicitly given by the formula
[TABLE]
where the positive semidefinite operator and the vector are given by
[TABLE]
Also we have
[TABLE]
from which we retain that and .
Now consider any sample and fix two indices with , and and . For we consider the behavior of ridge regression on the doubly modified sample ( is a convex subset of ). We write
[TABLE]
Then
[TABLE]
because and . Thus and similarly . Since it is clear that . Also
[TABLE]
similarly and again . We can then apply Lemma (21) and obtain
[TABLE]
where we used .
Now we define
[TABLE]
For the expected error we get
[TABLE]
and
[TABLE]
By a similar, somewhat more tedious, analysis there are absolute constants and , such that
[TABLE]
Now let . Then
[TABLE]
In particular . Also
[TABLE]
Substitution in the formula gives . Thus, from Theorem 3,
[TABLE]
5 Appendix: Proofs of the results in section 2.1
Throughout this appendix we adhere to the notation and definitions of section 2.1.
Proof of Lemma 8. Let . By l’Hospital’s rule we have . Furthermore
[TABLE]
Thus
[TABLE]
Combined with Markov’s inequality this gives the second assertion.
Conditional versions of and are obtained by replacing the unconditional expectations by the operator . Thus, for ,
[TABLE]
Then , and are members of . Observe that for any , a fact which will be frequently used in the sequel.
Lemma 22
Let be bounded measurable functions on . Then for any expectation
[TABLE]
Proof. Define an expectation functional by . The function is convex for positive , since . Thus, by Jensen’s inequality,
[TABLE]
The heart of the entropy method is the following theorem, which asserts the subadditivity of entropy.
Theorem 23
[TABLE]
Proof. Set and write as a telescopic product to get
[TABLE]
where we applied Lemma 22 to the expectation functional . From the definition of we then obtain
[TABLE]
We combine this with the following fluctuation representation of entropy.
Proposition 24
We have for
[TABLE]
Proof. Using and the fundamental theorem of calculus we obtain the formulas
[TABLE]
which we subtract to obtain
[TABLE]
The same argument gives the second inequality.
Combining Theorem 23 and Proposition 24 we obtain the following, very useful inequality (Theorem 7 in [14])
[TABLE]
which leads to a number of concentration inequalities, when used together with Lemma 8. The celebrated ”bounded difference inequality” (see e.g. McDiarmid [11], Theorem 3.7), for example, is an almost immediate consequence. We will also use a simple variational bound on the conditional thermal variance:
[TABLE]
We need two applications of (16). Recall the definition of the real function .
Proof of Lemma 9. For any , letting in (17),
[TABLE]
Thus with (16)
[TABLE]
Recall the definition of the operator by
[TABLE]
Proof of Lemma 11. We abbreviate to . Replacing by in (17) we get
[TABLE]
We now claim that the right hand side above is a non-decreasing function of . Too see this write and define a real function by . By a straighforward computation we obtain
[TABLE]
where the last inequality uses the well known fact that for and any expectation whenever is a nondecreasing function. This establishes the claim.
Using (16) it follows that
[TABLE]
where we used the identity .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Arcones, M. A. (1995). A Bernstein-type inequality for U-statistics and U-processes. Statistics & probability letters, 22(3), 239-247.
- 2[2] S.Bernstein , Theory of Probability, Moscow, 1927.
- 3[3] S.Boucheron,G.Lugosi,P.Massart , Concentration Inequalities using the entropy method, Annals of Probability 31, Nr 3, 2003
- 4[4] S.Boucheron, G.Lugosi, P.Massart , On concentration of self-bounding functions, Electronic Journal of Probability Vol.14 (2009), Paper no. 64, 1884–1899, 2009
- 5[5] S. Boucheron, G. Lugosi, P. Massart. Concentration Inequalities, Oxford University Press (2013)
- 6[6] Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar), 499-526.
- 7[7] Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 586-596.
- 8[8] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The annals of mathematical statistics, 293-325.
