A Bernstein-type inequality for functions of bounded interaction

Andreas Maurer

arXiv:1701.06191·math.PR·May 12, 2017

A Bernstein-type inequality for functions of bounded interaction

Andreas Maurer

PDF

TL;DR

This paper introduces a Bernstein-type concentration inequality for functions of independent variables with bounded interaction, extending classical bounds to more complex functions and improving results for U-statistics and regularized least squares.

Contribution

It provides a new distribution-dependent concentration inequality that generalizes Bernstein's inequality to functions with limited interaction among variables.

Findings

01

Sharper bounds for U-statistics

02

Improved generalization error estimates for regularized least squares

03

Extension of Bernstein's inequality to complex functions

Abstract

We give a distribution-dependent concentration inequality for functions of independent variables. The result extends Bernstein's inequality from sums to more general functions, whose variation in any argument does not depend too much on the other arguments. Applications sharpen existing bounds for U-statistics and the generalization error of regularized least squares.

Equations282

Pr {f (X_{1}, ..., X_{n}) - E [f (X_{1}, ..., X_{n})] > t} \leq exp (\frac{- t ^{2}}{2 \sum _{k} σ _{k}^{2} + 2 t /3}),

Pr {f (X_{1}, ..., X_{n}) - E [f (X_{1}, ..., X_{n})] > t} \leq exp (\frac{- t ^{2}}{2 \sum _{k} σ _{k}^{2} + 2 t /3}),

(S_{y}^{k} f) (x_{1}, ..., x_{n}) = f (x_{1}, ..., x_{k - 1}, y, x_{k + 1}, ..., x_{n})

(S_{y}^{k} f) (x_{1}, ..., x_{n}) = f (x_{1}, ..., x_{k - 1}, y, x_{k + 1}, ..., x_{n})

E_{k} f

E_{k} f

σ_{k}^{2} (f)

Σ^{2} (f) = k = 1 \sum n σ_{k}^{2} (f) .

Σ^{2} (f) = k = 1 \sum n σ_{k}^{2} (f) .

σ^{2} (f) \leq E [Σ^{2} (f)],

σ^{2} (f) \leq E [Σ^{2} (f)],

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 sup _{x \in Ω} Σ ^{2} ( f ) ( x ) + 2 b t /3}) .

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 sup _{x \in Ω} Σ ^{2} ( f ) ( x ) + 2 b t /3}) .

J (f) = x \in Ω sup k, l : k \neq = l \sum z, z^{'} \in Ω_{l} sup y, y^{'} \in Ω_{k} sup (D_{z, z^{'}}^{l} D_{y, y^{'}}^{k} f)^{2} (x)^{1/2} for f \in A (Ω) .

J (f) = x \in Ω sup k, l : k \neq = l \sum z, z^{'} \in Ω_{l} sup y, y^{'} \in Ω_{k} sup (D_{z, z^{'}}^{l} D_{y, y^{'}}^{k} f)^{2} (x)^{1/2} for f \in A (Ω) .

J_{μ} (f) = 2 x \in Ω sup l \sum z \in Ω_{l} sup k : k \neq = l \sum σ_{k}^{2} (f - S_{z}^{l} f) (x)^{1/2} .

J_{μ} (f) = 2 x \in Ω sup l \sum z \in Ω_{l} sup k : k \neq = l \sum σ_{k}^{2} (f - S_{z}^{l} f) (x)^{1/2} .

J_{μ} (f)

J_{μ} (f)

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 E [ Σ ^{2} ( f ) ] + ( 2 b /3 + J _{μ} ( f ) ) t}) .

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 E [ Σ ^{2} ( f ) ] + ( 2 b /3 + J _{μ} ( f ) ) t}) .

E [Σ^{2} (f)] \leq σ^{2} (f) + \frac{1}{4} J^{2} (f) .

E [Σ^{2} (f)] \leq σ^{2} (f) + \frac{1}{4} J^{2} (f) .

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 σ ^{2} ( f ) + J ^{2} ( f ) /2 + ( 2 b /3 + J _{μ} ( f ) ) t}) .

Pr {f - E f > t} \leq exp (\frac{- t ^{2}}{2 σ ^{2} ( f ) + J ^{2} ( f ) /2 + ( 2 b /3 + J _{μ} ( f ) ) t}) .

u (x) = (m n)^{- 1} 1 \leq j_{1} < ... < j_{m} \leq n \sum g (x_{j_{1}}, ..., x_{j_{m}}),

u (x) = (m n)^{- 1} 1 \leq j_{1} < ... < j_{m} \leq n \sum g (x_{j_{1}}, ..., x_{j_{m}}),

Pr {∣ u - E u ∣ > t} \leq 2 exp \frac{- n t ^{2}}{2 m ^{2} σ _{y \sim μ_{0}}^{2} ( E _{x \sim μ_{0}^{m - 1}} [ g ( y , x ) ] ) + \frac{m ^{2} ( m - 1 ) ^{2}}{n - m} + 16 m ^{2} t /3} .

Pr {∣ u - E u ∣ > t} \leq 2 exp \frac{- n t ^{2}}{2 m ^{2} σ _{y \sim μ_{0}}^{2} ( E _{x \sim μ_{0}^{m - 1}} [ g ( y , x ) ] ) + \frac{m ^{2} ( m - 1 ) ^{2}}{n - m} + 16 m ^{2} t /3} .

4 exp \frac{- n t ^{2}}{2 m ^{2} σ _{y \sim μ_{0}}^{2} ( E _{x \sim μ_{0}^{m - 1}} [ g ( y , x ) ] ) + ( 2 ^{m + 2} m ^{m} ( n - 1 ) / n + 2/3 m ^{- 1} ) t} .

4 exp \frac{- n t ^{2}}{2 m ^{2} σ _{y \sim μ_{0}}^{2} ( E _{x \sim μ_{0}^{m - 1}} [ g ( y , x ) ] ) + ( 2 ^{m + 2} m ^{m} ( n - 1 ) / n + 2/3 m ^{- 1} ) t} .

w_{z} = ar g w \in H min \frac{1}{n} i = 1 \sum n (⟨ w, x_{i} ⟩ - y_{i})^{2} + λ ∥ w ∥^{2} .

w_{z} = ar g w \in H min \frac{1}{n} i = 1 \sum n (⟨ w, x_{i} ⟩ - y_{i})^{2} + λ ∥ w ∥^{2} .

R (z) = E_{Z} (⟨ w_{z}, X ⟩ - Y)^{2} and \hat{R} (z) = \frac{1}{n} i = 1 \sum n (⟨ w_{z}, x_{i} ⟩ - y_{i})^{2} .

R (z) = E_{Z} (⟨ w_{z}, X ⟩ - Y)^{2} and \hat{R} (z) = \frac{1}{n} i = 1 \sum n (⟨ w_{z}, x_{i} ⟩ - y_{i})^{2} .

Pr {(R - \hat{R}) - E (R - \hat{R}) > t} \leq exp \frac{- n t ^{2}}{2 n E [ Σ ^{2} ( R - R ^ ) ( X ) ] + c λ ^{- 3} t} .

Pr {(R - \hat{R}) - E (R - \hat{R}) > t} \leq exp \frac{- n t ^{2}}{2 n E [ Σ ^{2} ( R - R ^ ) ( X ) ] + c λ ^{- 3} t} .

(R - \hat{R}) (Z)

(R - \hat{R}) (Z)

E_{β f} [g] = Z_{β f}^{- 1} E [g e^{β f}], g \in A (Ω),

E_{β f} [g] = Z_{β f}^{- 1} E [g e^{β f}], g \in A (Ω),

S_{f} (β) = K L (Z_{β f}^{- 1} e^{β f} d μ, d μ) = β E_{β f} [f] - ln Z_{β f},

S_{f} (β) = K L (Z_{β f}^{- 1} e^{β f} d μ, d μ) = β E_{β f} [f] - ln Z_{β f},

ln E [e^{β (f - E f)}] = β \int_{0}^{β} \frac{S _{f} ( γ )}{γ ^{2}} d γ

ln E [e^{β (f - E f)}] = β \int_{0}^{β} \frac{S _{f} ( γ )}{γ ^{2}} d γ

Pr {f - E f > t} \leq exp (β \int_{0}^{β} \frac{S _{f} ( γ )}{γ ^{2}} d γ - β t) .

Pr {f - E f > t} \leq exp (β \int_{0}^{β} \frac{S _{f} ( γ )}{γ ^{2}} d γ - β t) .

S_{f} (β) \leq ψ (β) E_{β f} [Σ^{2} (f)] .

S_{f} (β) \leq ψ (β) E_{β f} [Σ^{2} (f)] .

D g = k \sum (g - y \in Ω_{k} in f S_{y}^{k} g)^{2}, for g \in A (Ω) .

D g = k \sum (g - y \in Ω_{k} in f S_{y}^{k} g)^{2}, for g \in A (Ω) .

S_{f} (β) \leq \frac{β ^{2}}{2} E_{β f} [D f] .

S_{f} (β) \leq \frac{β ^{2}}{2} E_{β f} [D f] .

D f \leq a^{2} f .

D f \leq a^{2} f .

ln E [e^{β f}] \leq \frac{β E f}{1 - a ^{2} β /2},

ln E [e^{β f}] \leq \frac{β E f}{1 - a ^{2} β /2},

ln E [e^{β (f - E [f])}]

ln E [e^{β (f - E [f])}]

ln E [e^{β f}] \leq \frac{a ^{2} β}{2} ln E e^{β f} + β E f,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Bernstein-type inequality for functions of bounded interaction

Andreas Maurer

Adalbertstr. 55, D-80799 Munich, Germany

am”at”andreas-maurer.eu

Abstract

We give a distribution-dependent concentration inequality for functions of independent variables. The result extends Bernstein’s inequality from sums to more general functions, whose variation in any argument does not depend too much on the other arguments. Applications sharpen existing bounds for U-statistics and the generalization error of regularized least squares.

1 Introduction

If $X_{1},...,X_{n}$ are independent real random variables, with $X_{k}-EX_{k}\leq 1$ almost surely, and $f\left(X_{1},...,X_{n}\right)=\sum_{k}X_{k}$ , then Bernstein’s inequality [2] asserts that for $t>0$

[TABLE]

where $\sigma_{k}^{2}$ is the respective variance of $X_{k}$ . In this work we extend Bernstein’s inequality to more general functions $f$ .

This extension requires two modifications. First the variance $\sum_{k}\sigma_{k}^{2}$ is replaced by the Efron-Stein upper bound, or jackknife estimate, of the variance. Secondly a correction term $J\left(f\right)$ is added to the coefficient $2/3$ of $t$ in the denominator of the exponent. This correction term, which we call the interaction functional of $f$ , vanishes for sums and represents the extent to which the variation of $f$ in any given argument depends on other arguments.

To proceed we introduce some notation and conventions. Let $\Omega=\prod_{k=1}^{n}\Omega_{k}$ be some product of measurable spaces and let $\mathcal{A}\left(\Omega\right)$ be the algebra of all bounded, measurable real valued functions on $\Omega$ . For fixed $k\in\left\{1,...,n\right\}$ and $y,y^{\prime}\in\Omega_{k}$ define the substitution operator $S_{y}^{k}$ and the difference operator $D_{y,y^{\prime}}^{k}$ on $\mathcal{A}\left(\Omega\right)$ by

[TABLE]

and $D_{y,y^{\prime}}^{k}=S_{y}^{k}-S_{y^{\prime}}^{k}$ . Both $S_{y}^{k}f$ and $D_{y,y^{\prime}}^{k}f$ are independent of $x_{k}$ .

Let a probability measure $\mu_{k}$ be given on each $\Omega_{k}$ and let $\mu$ be the product measure $\mu=\prod\mu_{k}$ on $\Omega$ . For $f\in\mathcal{A}\left(\Omega\right)$ the expectation $Ef$ and variance $\sigma^{2}\left(f\right)$ are defined as $Ef=\int_{\Omega}fd\mu$ and $\sigma^{2}\left(f\right)=E\left[\left(f-Ef\right)^{2}\right]$ . For $k\in\left\{1,...,n\right\}$ the conditional expectation $E_{k}$ and the conditional variance $\sigma_{k}^{2}$ are operators on $\mathcal{A}\left(\Omega\right)$ , which act on a function $f\in\mathcal{A}\left(\Omega\right)$ as

[TABLE]

where $\mu_{k}^{2}$ is the product measure $\mu_{k}\times\mu_{k}$ on $\Omega_{k}\times\Omega_{k}$ . The sum of conditional variances (SCV) operator $\Sigma^{2}\left(f\right):\mathcal{A}\left(\Omega\right)\rightarrow\mathcal{A}\left(\Omega\right)$ is defined as

[TABLE]

This operator appears in the Efron-Stein inequality ([7],[15], see also Section 2.4) as

[TABLE]

which becomes an equality if $f$ is a sum of real valued functions $X_{k}$ on $\Omega_{k}$ . It also appears in the following exponential tail bound (see McDiarmid [11], Theorem 3.8, or [14], Theorem 11).

Theorem 1

Suppose that $f\in\mathcal{A}\left(\Omega\right)$ satisfies $f-E_{k}f\leq b$ for all $k\in\left\{1,...,n\right\}$ . Then

[TABLE]

This inequality reduces to Bernstein’s inequality if $f$ is a sum, but it suffers from the worst-case choice of the configuration $\mathbf{x}$ , for which $\Sigma^{2}\left(f\right)\left(\mathbf{x}\right)$ is evaluated. The supremum in $\mathbf{x}$ is a hindrance to estimation of the variance term, and we would like to replace it by an expectation, just as in the Efron-Stein inequality.

This replacement is trivially possible when $f$ is a sum, because then $\Sigma^{2}\left(f\right)$ is constant. It turns out that it is also possible if $\Sigma^{2}\left(f\right)$ has the right properties of concentration about its mean - a surrogate of being constant, so to speak. To insure this we control the interaction between the different arguments of $f$ , in the sense that the variation in any argument must not depend too much on the other arguments.

Definition 2

The interaction functional $J:\mathcal{A}\left(\Omega\right)\rightarrow\mathbb{R}_{0}^{+}$ is defined by

[TABLE]

The distribution-dependent interaction functional $J_{\mu}$ is defined by

[TABLE]

These quantities are related and bounded using the inequalities

[TABLE]

(see the end of section 2.3). For our applications below the last, simplest and crudest bound appears to be sufficient. The above functionals and bounds vanish for sums and are positive homogeneous of degree one. The following is our main result.

Theorem 3

Suppose $f\in\mathcal{A}\left(\Omega\right)$ satisfies $f-E_{k}f\leq b$ for all $k$ . Then for all $t>0$

[TABLE]

Remarks:

If this is applied to sums of independent random variables (real valued functions $X_{k}$ defined on $\Omega_{k}$ ), we recover Bernstein’s inequality.
Consider the case that $\Omega_{k}=\Omega_{0}$ , $\mu_{k}=\mu_{0}$ and a sequence of functions $f_{n}\in\mathcal{A}\left(\Omega_{0}^{n}\right)$ , such that $J_{\mu}\left(f_{n}\right)/\sqrt{n}\rightarrow 0$ (for example if $J_{\mu}\left(f_{n}\right)$ is bounded) and such that the limit $\sigma^{2}=\lim_{n\rightarrow\infty}E\left[\Sigma^{2}\left(f_{n}\right)\right]/n$ exists. Applying Theorem 3 to the sequence $f_{n}/\sqrt{n}$ , and letting $n\rightarrow\infty$ , we obtain the tail of a normal distribution with variance $\sigma^{2}$ . In some cases, like U-statistics, this is known to be the correct limiting distribution (Hoeffding [8], Theorem 7.1).
Although the distribution dependent functional $J_{\mu}$ is potentially much smaller than $J$ , in the applications considered sofar it seems sufficient to consider $J$ or the above bounds thereof.
Since $E\left[\Sigma^{2}\left(f\right)\right]\leq\sup_{\mathbf{x}}\Sigma^{2}\left(f\right)\left(\mathbf{x}\right)\leq\sup_{\mathbf{x}}\left(1/4\right)\sum_{k}\sup_{y,y^{\prime}}\left(D_{y,y^{\prime}}^{k}\left(f\right)\right)^{2}\left(\mathbf{x}\right)$ , the variance term above can never be larger than the variance term in Theorem 1, which in turn can never be larger than what we get from the bounded difference inequality (McDiarmid [11], Theorem 3.7, or Boucheron et al [5], Theorem 6.5).
If also $f-E_{k}f\geq-b$ , then the result can be applied to $-f$ so as to obtain a two-sided inequality.

In Theorem 2.1 of [9] Christian Houdré bounds the bias in the Efron-Stein inequality in terms of iterated jackknive estimates of variance, which correspond to the expectations of higher order differences. The second of these iterates can be bounded in terms of the interaction functional and allows us to put the variance $\sigma^{2}\left(f\right)$ back into the inequality of Theorem 3.

Proposition 4

[TABLE]

See Section 2.4 for the proof. In combination with Theorem 3 we obtain the following corollary.

Corollary 5

Suppose $f\in\mathcal{A}$ and $f-E_{k}f\leq b$ for all $k$ . Then for all $t>0$

[TABLE]

We apply Theorem 3 in two seemingly very different situations.

For U-statistics with bounded, symmetric kernels it is surprisingly easy to bound the interaction functional, and an application of Theorem 3 leads to the following concentration result.

Theorem 6

If $\mu_{0}$ is a probability measure on $\mathcal{X}$ and $\mu=\mu_{0}^{n}$ on $\mathcal{X}^{n},$ and $g$ is a measurable, symmetric (permutation invariant) kernel $g:\mathcal{X}^{m}\rightarrow\left[-1,1\right]$ with $1<m<n$ , and $u\in\mathcal{A}\left(\mathcal{X}^{n}\right)$ is defined by

[TABLE]

then for $t>0$

[TABLE]

A similar bound given by Arcones ([1], Theorem 2) is

[TABLE]

For large $m$ , $n$ or deviation $t$ the bound in Theorem 6 is the smaller one of the two. Already for order $m=2$ it gives an improvement if $\left(n-m\right)t\geq 0.12$ . For order $m=3$ the crossover is already at $\left(n-m\right)t\approx 6\times 10^{-2}$ , for order $m=4$ at $\left(n-m\right)t\approx 10^{-2}$ .

In a completely different context Theorem 3 can be applied to sharpen a stability based generalization bound for regularized least squares (RLS).

Let $\mathbb{B}$ be the unit ball in a separable, real Hilbertspace, and let $\mathcal{Z}=\mathbb{B\times}\left[-1,1\right]$ . Fix $\lambda\in\left(0,1\right)$ . For $\mathbf{z=}\left(\left(x_{1},y_{1}\right),...,\left(x_{n},y_{n}\right)\right)\in\mathcal{Z}^{n}$ regularized least squares returns the vector

[TABLE]

Let $\mathbf{Z}=\left(Z_{1},...,Z_{n}\right)$ be a vector of independent random variables with values in $\mathcal{Z}$ , where $Z_{i}$ is identically distributed to $Z=\left(X,Y\right)$ . We can apply Theorem 3, to obtain tailbounds for the random variable $R\left(\mathbf{Z}\right)-\hat{R}\left(\mathbf{Z}\right)$ , where the ”true error” $R$ and the ”empirical error” $\hat{R}$ are defined on $\mathcal{Z}^{n}$ by

[TABLE]

We can prove the following result.

Theorem 7

There is an absolute constant $c$ such that for every $t>0$

[TABLE]

Solving for $t$ with a fixed bound $\delta$ on the probability we obtain that with probability at least $1-\delta$ in $\mathbf{Z}$

[TABLE]

It can be shown ([6]) that the expectation $E\left[\left(R-\hat{R}\right)\left(\mathbf{Z}\right)\right]$ is of order $1/n$ , so for large sample sizes the generalization error $\left(R-\hat{R}\right)\left(\mathbf{Z}\right)$ is dominated by the variance term, which may be considerably smaller than the distribution-independent bound obtained from the bounded difference inequality as in [6] (it can never be larger because of Remark 4 above). Using techniques as in [13] this term can in principle be estimated from a sample and the estimate combined with the above to a purely data-dependent bound.

A major drawback here is the dependence on $\lambda^{-3}$ in the last term, because in practical applications the regularization parameter $\lambda$ typically decreases with $n$ . The $\lambda^{-3}$ is likely due to a very crude method of bounding $J\left(f\right)$ by differentiation. A more intelligent method might give $\lambda^{-2}n^{-1}$ .

It seems plausible that similar bounds exist for Tychonov regularization with other more general loss functions having appropriate properties.

The idea of using second differences (as in the definition of $J$ ) has been put to work by Houdré [9] to estimate the bias in the Efron-Stein inequality. The entropy method, which underlies our proof of Theorem 3, has been developed by a number of authors, notably Ledoux [10] and Boucheron, Lugosi and Massart [3]. The latter work also introduces the key-idea of combining it with the decoupling method used below. Our proof follows a thermodynamic formulation of the entropy method as laid out in [14].

The next section gives a proof of Theorem 3. Then follow the applications to U-statistics and ridge regression.

2 Proof of Theorem 3

The proof of our main result, Theorem 3, uses the entropy method ([10], [3],[5]), from which the next section collects a set of tools. These results are taken from [14], which gives more detailed proofs and additional motivation. For the benefit of the reader, and to make the paper more self-contained, corresponding proofs are also given in a technical appendix.

2.1 Definitions and tools

$\Omega$ and $\mathcal{A}\left(\Omega\right)$ are as in the introduction, $\mathcal{A}_{k}\left(\Omega\right)$ is the subalgebra of $\mathcal{A}\left(\Omega\right)$ of those bounded, measurable functions on $\Omega$ which are independent of the $k$ -th coordinate. For $f\in\mathcal{A}\left(\Omega\right)$ and $\beta\in\mathbb{R}$ define the expectation functional $E_{\beta f}$ on $\mathcal{A}\left(\Omega\right)$ by

[TABLE]

where $Z_{\beta f}=E\left[e^{\beta f}\right]$ . The entropy $S_{f}\left(\beta\right)$ of $f$ at $\beta$ is given by

[TABLE]

where $KL\left(\nu,\mu\right)$ is the Kullback-Leibler divergence.

Lemma 8

(Theorem 1 in [14]) For any $f\in\mathcal{A}\left(\Omega\right)$ and $\beta>0$ we have

[TABLE]

and, for $t\geq 0$ ,

[TABLE]

Define the real function $\psi$ by $\psi\left(t\right):=te^{t}-e^{t}+1$ .

Lemma 9

(Lemma 10 in [14]) Let $f\in\mathcal{A}\left(\Omega\right)$ satisfy $f-E_{k}f\leq 1$ for all $k\in\left\{1,...,n\right\}$ . Then for $\beta>0$

[TABLE]

Bounding $E_{\beta f}\left[\Sigma^{2}\left(f\right)\right]\leq\sup_{\mathbf{x}}\Sigma^{2}\left(f\right)\left(\mathbf{x}\right)$ and using Lemma 8 quickly leads to a proof of Theorem 1. For Theorem 3 we need more tools.

Definition 10

The operator $D:\mathcal{A}\left(\Omega\right)\mathcal{\rightarrow A}\left(\Omega\right)$ is defined by

[TABLE]

To clarify: $\inf_{y\in\Omega_{k}}S_{y}^{k}g$ is the member of $\mathcal{A}\left(\Omega\right)$ defined by $\left(\inf_{y\in\Omega_{k}}S_{y}^{k}g\right)\left(\mathbf{x}\right)=\inf_{y\in\Omega_{k}}\left(S_{y}^{k}\left(g\left(\mathbf{x}\right)\right)\right)$ . It does not depend on $x_{k}$ , so $\inf_{y\in\Omega_{k}}S_{y}^{k}g\in\mathcal{A}_{k}\left(\Omega\right)$ .

Lemma 11

(Lemma 15 in [14], also Proposition 5 in [12]) We have, for $\beta>0$ , that

[TABLE]

We use this to derive the following property of weakly self-bounded functions, which, together with Proposition 17 below, gives the concentration property of $\Sigma^{2}\left(f\right)$ alluded to in the introduction.

Lemma 12

Suppose that

[TABLE]

Then for $\beta\in\left(0,2/a^{2}\right)$

[TABLE]

Proof. Using Lemma 8 and Lemma 11 and the weak self-boundedness assumption (2) we have for $\beta>0$ that

[TABLE]

where the last identity follows from the fact that $E_{\gamma f}\left[f\right]=\left(d/d\gamma\right)\ln Ee^{\gamma f}$ . Thus

[TABLE]

and rearranging this inequality for $\beta\in\left(0,2/a^{2}\right)$ establishes the claim.

We also use the following decoupling technique: If $\mu$ and $\nu$ are two probability measures and $\nu$ is absolutely continuous w.r.t. $\mu$ then it is easy to show that

[TABLE]

Applying this inequality when $\nu$ is the measure $Z_{\beta f}^{-1}e^{\beta f}d\mu$ we obtain the following

Lemma 13

We have for any $g\in\mathcal{A}\left(\Omega\right)$ that

[TABLE]

2.2 A concentration inequality

We now use the tools of the previous section to prove an intermediate concentration inequality (Proposition 16) in the case that $\Sigma^{2}\left(f\right)$ satisfies the self-bounding hypothesis of Lemma 12. In the next section we show that this condition is satisfied if $a$ is taken equal to the interaction functional $J_{\mu}\left(f\right)$ , and together the two results then give Theorem 3.

We need two more auxiliary results. Recall the definition of the function $\psi\left(t\right):=te^{t}-e^{t}+1$ .

Lemma 14

For any $a\geq 0$ and $0\leq\gamma<1/\left(1/3+a/2\right)$ we have

(i) $a\sqrt{\psi\left(\gamma\right)/2}<1$ and

(ii)

[TABLE]

Proof. If $0\leq\gamma<1/\left(1/3+a/2\right)$ and $a\geq 0$ then $\gamma<3$ . In this case we have the two convergent power series representations

[TABLE]

Now $b_{0}=c_{0}=1/2$ by inspection and for $n\geq 1$

[TABLE]

so that $b_{n}\geq c_{n}$ for all non-negative $n$ . Term by term comparison of the two power series gives

[TABLE]

which is (ii) in the case that $a=0$ .

It also gives us for general $a>0$ that

[TABLE]

since $\gamma<1/\left(1/3+a/2\right)\implies\gamma/\left(2\left(1-\gamma/3\right)\right)<a^{-1}$ . This proves (i).

(ii) is equivalent to

[TABLE]

To complete the proof it suffices by (5) to show that the right hand side above is, for fixed $\gamma,$ a non-decreasing function of $a\in\left[0,2\left(1-\gamma/3\right)/\gamma\right)$ . Let $b:=\sqrt{\psi\left(\gamma\right)/2}$ , $c:=\left(1-\gamma/3\right)$ and $d:=\gamma/2$ , so the expression in question becomes $\left(1-ab\right)^{2}/\left(2\left(c-ad\right)^{2}\right)$ . Calculus gives

[TABLE]

But $c-ad=1-\left(1/3+a/2\right)\gamma>0$ by assumption. Also $1-ab>0$ by (i) and, using (6),

[TABLE]

The expression $\left(1-ab\right)^{2}/\left(2\left(c-ad\right)^{2}\right)$ is therefore non-decreasing in $a$ .

We finally need an optimization lemma

Lemma 15

Let $C$ and $b$ denote two positive real numbers, $t>0$ . Then

[TABLE]

The proof of this lemma can be found in [12] (Lemma 12).

Proposition 16

Suppose that $f\in\mathcal{A}\left(\Omega\right)$ is such that $\forall k$ , $f-E_{k}\left(f\right)\leq 1$ , and that

[TABLE]

with $a\geq 0$ . Then for all $t>0$

[TABLE]

Proof. By a simple limiting argument we may assume that $a>0$ . Now let $0<\gamma\leq\beta<1/\left(1/3+a/2\right)$ . By Lemma 14 (i) $\theta:=\left(1/a\right)\sqrt{2\psi\left(\gamma\right)}<2/a^{2}$ and also $\theta>\sqrt{\psi\left(\gamma\right)/2}\sqrt{2\psi\left(\gamma\right)}=\psi\left(\gamma\right)$ . By Lemma 9

[TABLE]

where the second inequality follows from Lemma 13. Subtracting $\theta^{-1}\psi\left(\gamma\right)S_{f}\left(\gamma\right)$ , multiplying by $\theta$ and using Lemma 12 together with the assumed self-boundedness of $\Sigma^{2}\left(f\right)$ gives us

[TABLE]

which holds, since $\theta<2/a^{2}$ . Since $\theta>\psi\left(\gamma\right)$ we can divide by $\theta-\psi\left(\gamma\right)$ to rearrange and then use the definition of $\theta$ to obtain

[TABLE]

By Lemma 14 (ii) for $\beta<1/\left(1/3+a/2\right)$

[TABLE]

and from Lemma 8

[TABLE]

where we used Lemma 15 in the last step.

2.3 Self-boundedness of the sum of conditional variances

We record some obvious, but potentially confusing properties of the substitution operator. For $k\in\left\{1,...,n\right\}$ and $y\in\Omega_{k}$ the operator $S_{y}^{k}$ is a homomorphism of $\mathcal{A}\left(\Omega\right)$ and the identity on $\mathcal{A}_{k}\left(\Omega\right)$ . If $l\neq k$ it commutes with $S_{z}^{l}$ and with $E_{l}$ . Most importantly

[TABLE]

Note however that for $l=k$ we get $S_{y}^{k}S_{z}^{k}=S_{z}^{k}$ and $S_{y}^{k}E_{k}=E_{k}$ and $S_{y}^{k}\sigma_{k}^{2}=\sigma_{k}^{2}$ , because $S_{z}^{k}$ , $E_{k}$ and $\sigma_{k}^{2}$ map to $\mathcal{A}_{k}\left(\Omega\right)$ .

Proposition 17

We have $D\left(\Sigma^{2}\left(f\right)\right)\leq J_{\mu}\left(f\right)^{2}~{}\Sigma^{2}\left(f\right)$ for any $f\in\mathcal{A}\left(\Omega\right)$ .

Proof. Fix $\mathbf{x}\in\Omega$ . Below all members of $\mathcal{A}\left(\Omega\right)$ are understood as evaluated on $\mathbf{x}$ . For $l\in\left\{1,...,n\right\}$ let $z_{l}\in\Omega_{l}$ be a minimizer in $z$ of $S_{z}^{l}\Sigma^{2}\left(f\right)$ (existence is assumed for simplicity, an approximate minimizer would also work), so that

[TABLE]

where we used the fact that $S_{z_{l}}^{l}\sigma_{l}^{2}\left(f\right)=\sigma_{l}^{2}\left(f\right)$ , because $\sigma_{l}^{2}\left(f\right)\in\mathcal{A}_{l}\left(\Omega\right)$ . Then

[TABLE]

This step gave us a sum over $k\neq l$ , which is important, because it allows us to use the commutativity properties mentioned above. Then, using $2\sigma_{k}^{2}\left(f\right)=E_{\left(y,y^{\prime}\right)\sim\mu_{k}^{2}}\left(D_{y,y^{\prime}}^{k}f\right)^{2}$ , we get

[TABLE]

by an application of Cauchy-Schwarz. Now, using $\left(a+b\right)^{2}\leq 2a^{2}+2b^{2}$ , we can bound the last sum independent of $l$ by

[TABLE]

so that

[TABLE]

Theorem 3 for the case $b=1$ is obtained by substituting $J_{\mu}\left(f\right)$ for $a$ in Proposition 16. The general case follows from rescaling and the homogeneity properties of $\Sigma^{2}$ and $J_{\mu}$ .

Of the inequalities in (1) only the first one is not completely obvious:

[TABLE]

In the last inequality we used the fact that the variance of a random variable is bounded by a quarter of the square of its range, so that $\sigma_{k}^{2}\left(f\right)\leq\left(1/4\right)\sup_{y,y^{\prime}}\left(D_{y,y^{\prime}}^{k}f\right)^{2}$ for all $f\in\mathcal{A}\left(\Omega\right)$ .

2.4 The Bias in the Efron-Stein inequality

Since the published work of Houdré [9] assumes symmetric functions and iid data, we give an independent derivation.

Let $X_{1},...,X_{n}$ be independent variables with $X_{i}$ distributed as $\mu_{i}$ in $\mathcal{\Omega}_{i}$ , and let $X_{1}^{\prime},...,X_{n}^{\prime}$ be independent copies thereof. Denote $X=\left(X_{1},...,X_{n}\right)$ and $X^{\prime}=\left(X_{1}^{\prime},...,X_{n}^{\prime}\right)$ and

[TABLE]

We also write $X^{\backslash i}$ for $X,$ but with the variable $X_{i}$ removed.

Let $f:\prod\mathcal{\Omega}_{i}\rightarrow\mathbb{R}$ satisfy $E\left[f\right]=0$ . Then, writing $f\left(X\right)-f\left(X^{\prime}\right)$ as a telescopic series, we get

[TABLE]

where the last identity is obtained by exchanging $X_{k}$ and $X_{k}^{\prime}$ . This gives the nice variance formula

[TABLE]

appearantly due to Chatterjee. The Cauchy-Schwarz inequality then gives the Efron-Stein inequality

[TABLE]

Now we look at the bias in this inequality.

Theorem 18

With above conventions we have

[TABLE]

The proof uses Chatterjee’s formula (8) twice. First we establish a lemma, which itself already uses the Efron Stein inequality.

Lemma 19

[TABLE]

Together with the Efron Stein inequality (9) this gives the attractive chain of inequalities

[TABLE]

Proof of Lemma 19. By induction on $n$ . Recall the total variance formula

[TABLE]

With $f\left(X\right)=Z$ this gives the case $n=1$ . For $n=2$ we get

[TABLE]

where we used the Efron-Stein inequality (9). This is where independence comes in and gives us the case $n=2$ . Suppose now that the lemma holds for $n-1$ . Then

[TABLE]

where the first inequality follows from the induction hypothesis, and the second inequality follows from applying the case $n=2$ to the two random variables $\left(X_{1},...,X_{n-1}\right)$ and $X_{n}$ .

Now we tackle the bias in the Efron Stein inequality. The strategy is to first use Chatterjee’s variance formula on each individual term on the right hand side of (9) and then sum the results.

The only difficulty here is notational because we now need more shadow variables. We deal with this problem by augmenting the vectors $X$ and $X^{\prime}$ to become $n+1$ dimensional.

Proof of Theorem 18. First fix an index $k$ and observe that $f\left(X\right)-f\left(X^{\left(k\right)}\right)$ depends on $n+1$ independent variables. We introduce variables $X_{n+1}$ which is iid to $X_{k},$ and an independent copy $X_{n+1}^{\prime}$ thereof, and consider correspondingly augmented vectors $X$ and $X^{\prime}$ with $n+1$ independent components. We also introduce functions $g_{k},\psi,\phi:\left(\prod_{i=1}^{n}\mathcal{X}_{i}\right)\times\mathcal{X}_{k}\rightarrow\mathbb{R}$ defined by

[TABLE]

and $g_{k}=\psi-\phi$ . Then $E\left[\left(f\left(X\right)-f\left(X^{\left(k\right)}\right)\right)^{2}\right]=E\left[g_{k}\left(X\right)^{2}\right]$ . Now we use Chatterjee’s formula (8) with $n$ replaced by $n+1$ and $f$ replaced by $g_{k}$ . We obtain

[TABLE]

Since $\psi$ does not depend on $x_{n+1}$ we have

[TABLE]

The last identity follows from the definition of the function $\phi$ . Since $\phi$ does not depend on $x_{k}$ we have

[TABLE]

Substituting these identities in (10), dividing by $4$ and summing over $k$ gives

[TABLE]

In the inequality we bounded the first term with Cauchy-Schwarz. The second term is equal to $\sigma^{2}\left(f\right)/2$ by Chatterjee’s formula (8), and the last term is bounded by $\sigma^{2}\left(f\right)/2$ using Lemma 19.

Proposition 4 is an immediate consequence of Theorem 18.

3 Application to U-statistics

In this section we prove Theorem 6, which simplifies with some notation. If $B$ is a set and $m\in\mathbb{N}$ , then $\mathcal{S}_{B}^{m}$ denotes the set of all those subsets of $B$ which have cardinality $m$ . Also, if $S\subseteq\left\{1,...,n\right\}$ and $x\in\mathcal{X}^{n}$ , we use $x_{S}$ to denote the vector $\left(x_{j_{1}},...,x_{j_{\left|S\right|}}\right)\in\mathcal{X}^{\left|S\right|}$ , where $\left\{j_{1},...,j_{\left|S\right|}\right\}=S$ and the $j_{k}$ are increasingly ordered. For $y,z\in\mathcal{X}$ we use $\left(y,x_{S}\right)$ and $\left(y,z,x_{S}\right)$ to denote respectively the vectors $\left(y,x_{j_{1}},...,x_{j_{\left|S\right|}}\right)\in\mathcal{X}^{\left|S\right|+1}$ and $\left(y,z,x_{j_{1}},...,x_{j_{\left|S\right|}}\right)\in\mathcal{X}^{\left|S\right|+2}$ . With this notation

[TABLE]

We also need a combinatorial lemma.

Lemma 20

For $n>m$

[TABLE]

Proof. Clearly

[TABLE]

Now

[TABLE]

Then we rewrite the enumerator using

[TABLE]

to get

[TABLE]

Proof of Theorem 6. With reference to any given $k\in\left\{1,...,n\right\}$ , and using the symmetry of $g$ ,

[TABLE]

This gives

[TABLE]

because $g$ takes values in an interval of diameter $2$ . This allows to apply Theorem 3 with $b=2m/n$ .

Next we bound the interaction functional $J\left(u\right)$ . For $k\neq l$ , and $y,y^{\prime}\in\Omega_{k}$ and $z,z^{\prime}\in\Omega_{l}$ we get

[TABLE]

so that

[TABLE]

Theorem 3 then gives us

[TABLE]

To bound $E\left[\Sigma^{2}\left(u\right)\right]$ we will write $\sigma_{k}^{2}\left(u\right)$ as a sum of two sums, where the first sum is over disjoint pairs $\left(S,S^{\prime}\right)\in\left(\mathcal{S}_{\left\{1,...,n\right\}\backslash k}^{m-1}\right)^{2}$ , and the second sum is over intersecting pairs. If $S$ and $S^{\prime}\in\mathcal{S}_{\left\{1,...,n\right\}\backslash k}^{m-1}$ are disjoint, then, since all the $\mu_{k}$ are equal to $\mu_{0}$ ,

[TABLE]

On the other hand we can use Lemma 20 to bound the number of intersecting pairs and obtain

[TABLE]

Summing over $k$ , dividing by $2$ and inserting in (11) gives us

[TABLE]

Converting to a two sided bound gives the result.

Instead of Theorem 3 to obtain (11) we could have used Corollary 5 and appealed to known results about $\sigma^{2}\left(u\right)$ (as in [8]).

4 Application to ridge regression

In this section we prove Theorem 7. The key to the application of Theorem 3 is the following Lemma ( $\mathcal{L}^{+}\left(H\right)$ denoting the cone of nonnegative definite operators in $H$ ).

Lemma 21

Let $G:\left(0,1\right)^{2}\rightarrow\mathcal{L}^{+}\left(H\right)$ and $g:\left(0,1\right)^{2}\rightarrow H$ be both twice continuously differentiable, satisfying the conditions $\frac{\partial^{2}}{\partial s\partial t}G=0$ , $\frac{\partial^{2}}{\partial s\partial t}g=0$ , $\left\|\frac{\partial}{\partial t}G\right\|\leq B_{1}$ , $\left\|\frac{\partial}{\partial s}G\right\|\leq B_{1}$ , $\left\|\frac{\partial}{\partial t}g\right\|\leq B_{2}$ and $\left\|\frac{\partial}{\partial s}g\right\|\leq B_{2}$ for real numbers $B_{1}$ and $B_{2}$ . For $\lambda>0$ define a function $w:\left(0,1\right)^{2}\rightarrow H$ by

[TABLE]

Then $w$ is twice differentiable and

[TABLE]

Proof. A standard argument shows that $\left\|\left(G+\lambda\right)^{-1}\right\|\leq\lambda^{-1}$ (we use $\left\|.\right\|$ for the operator norm and for vectors in $H$ , depending on context) and that

[TABLE]

so

[TABLE]

Then

[TABLE]

This gives (13). Also, using the fact that the mixed partials vanish by assumption,

[TABLE]

which gives (14).

Proof of Theorem 7. It is well known and easily verified that $w_{\mathbf{z}}$ is well defined and explicitly given by the formula

[TABLE]

where the positive semidefinite operator $G_{\mathbf{z}}$ and the vector $g_{\mathbf{z}}=g$ are given by

[TABLE]

Also we have

[TABLE]

from which we retain that $\sum\left(\left\langle w_{\mathbf{z}},x_{i}\right\rangle-y_{i}\right)^{2}\leq n$ and $\left\|w_{\mathbf{z}}\right\|\leq\lambda^{-1/2}$ .

Now consider any sample $\mathbf{z}\in\mathcal{Z}^{n}$ and fix two indices $1\leq k,l\leq n$ with $k\neq l$ , and $z_{l}^{\prime}=\left(x_{l}^{\prime},y_{l}^{\prime}\right),z_{k}^{\prime}=\left(x_{k}^{\prime},y_{k}^{\prime}\right),z_{l}^{\prime\prime}=\left(x_{l}^{\prime\prime},y_{l}^{\prime\prime}\right)\in\mathcal{Z}$ and $z_{k}^{\prime\prime}=\left(x_{k}^{\prime\prime},y_{k}^{\prime\prime}\right)\in\mathcal{Z}$ . For $\left(s,t\right)\in\left(0,1\right)^{2}$ we consider the behavior of ridge regression on the doubly modified sample $\mathbf{z}\left(s,t\right):=S_{z_{l}^{\prime}+s\left(z_{l}^{\prime\prime}-z_{l}^{\prime}\right)}^{l}S_{z_{k}^{\prime}+t\left(z_{k}^{\prime\prime}-z_{k}^{\prime}\right)}^{k}\mathbf{z}$ ( $\mathcal{Z}$ is a convex subset of $H\times\mathbb{R}$ ). We write

[TABLE]

Then

[TABLE]

because $\left\|x_{k}^{\prime\prime}-x_{k}^{\prime}\right\|\leq 2$ and $\left\|x_{l}^{\prime}+t\left(x_{k}^{\prime\prime}-x_{k}^{\prime}\right)\right\|\leq 1$ . Thus $\left\|\left(\partial/\partial t\right)G\right\|\leq 4/n$ and similarly $\left\|\left(\partial/\partial s\right)G\right\|\leq 4/n$ . Since $k\neq l$ it is clear that $\left(\partial^{2}/\left(\partial s\partial t\right)\right)G=0$ . Also

[TABLE]

similarly $\left\|\left(\partial/\partial s\right)g\right\|\leq 4/n$ and again $\left(\partial^{2}/\left(\partial s\partial t\right)\right)g=0$ . We can then apply Lemma (21) and obtain

[TABLE]

where we used $\left\|w\right\|\leq\lambda^{-1/2}$ .

Now we define

[TABLE]

For the expected error we get

[TABLE]

and

[TABLE]

By a similar, somewhat more tedious, analysis there are absolute constants $c_{1}$ and $c_{2}$ , such that

[TABLE]

Now let $f\left(\mathbf{x}\right)=\left(n\lambda^{2}/c_{1}\right)\left(R\left(\mathbf{x}\right)-\hat{R}\left(\mathbf{x}\right)\right)$ . Then

[TABLE]

In particular $f-E_{k}f\leq 1$ . Also

[TABLE]

Substitution in the formula gives $J\left(f\right)\leq\left(c_{2}/c_{1}\right)\lambda^{-1}$ . Thus, from Theorem 3,

[TABLE]

5 Appendix: Proofs of the results in section 2.1

Throughout this appendix we adhere to the notation and definitions of section 2.1.

Proof of Lemma 8. Let $A_{f}\left(\beta\right)=\left(1/\beta\right)\ln Z_{\beta f}$ . By l’Hospital’s rule we have $\lim_{\beta\rightarrow 0}A_{f}\left(\beta\right)=E\left[f\right]$ . Furthermore

[TABLE]

Thus

[TABLE]

Combined with Markov’s inequality this gives the second assertion.

Conditional versions of $E_{\beta f}$ and $S_{f}\left(\beta\right)$ are obtained by replacing the unconditional expectations $E$ by the operator $E_{k}$ . Thus, for $f,g\in\mathcal{A}\left(\Omega\right)$ ,

[TABLE]

Then $E_{k,\beta f}\left[g\right]$ , $\sigma_{k,\beta f}^{2}\left[g\right]$ and $S_{k,f}\left(\beta\right)$ are members of $\mathcal{A}_{k}\left(\Omega\right)$ . Observe that $E_{k,\beta f}=E_{k,\beta f+f_{k}}$ for any $f_{k}\in\mathcal{A}_{k}\left(\Omega\right)$ , a fact which will be frequently used in the sequel.

Lemma 22

Let $h,g>0$ be bounded measurable functions on $\Omega$ . Then for any expectation $E$

[TABLE]

Proof. Define an expectation functional $E_{g}$ by $E_{g}\left[h\right]=E\left[gh\right]/E\left[g\right]$ . The function $\Phi\left(t\right)=t\ln t$ is convex for positive $t$ , since $\Phi^{\prime\prime}=1/t>0$ . Thus, by Jensen’s inequality,

[TABLE]

The heart of the entropy method is the following theorem, which asserts the subadditivity of entropy.

Theorem 23

[TABLE]

Proof. Set $\rho=e^{\beta f}/Z_{\beta f}$ and write $\rho=\rho/E\left[\rho\right]$ as a telescopic product to get

[TABLE]

where we applied Lemma 22 to the expectation functional $E_{1}...E_{k-1}$ . From the definition of $\rho$ we then obtain

[TABLE]

We combine this with the following fluctuation representation of entropy.

Proposition 24

We have for $\beta>0$

[TABLE]

Proof. Using $\left(d/d\beta\right)E_{\beta f}\left[f\right]=\sigma_{\beta f}^{2}\left[f\right]$ and the fundamental theorem of calculus we obtain the formulas

[TABLE]

which we subtract to obtain

[TABLE]

The same argument gives the second inequality.

Combining Theorem 23 and Proposition 24 we obtain the following, very useful inequality (Theorem 7 in [14])

[TABLE]

which leads to a number of concentration inequalities, when used together with Lemma 8. The celebrated ”bounded difference inequality” (see e.g. McDiarmid [11], Theorem 3.7), for example, is an almost immediate consequence. We will also use a simple variational bound on the conditional thermal variance:

[TABLE]

We need two applications of (16). Recall the definition of the real function $\psi\left(t\right):=te^{t}-e^{t}+1$ .

Proof of Lemma 9. For any $k\in\left\{1,...,n\right\},\beta>0$ , letting $f_{k}=E_{k}f$ in (17),

[TABLE]

Thus with (16)

[TABLE]

Recall the definition of the operator $D:\mathcal{A\left(\Omega\right)\rightarrow A}\left(\Omega\right)$ by

[TABLE]

Proof of Lemma 11. We abbreviate $\inf_{y\in\Omega_{k}}S_{y}^{k}f$ to $\inf_{k}f$ . Replacing $f_{k}$ by $\inf_{k}f$ in (17) we get

[TABLE]

We now claim that the right hand side above is a non-decreasing function of $\beta$ . Too see this write $h=f-\inf_{k}f$ and define a real function $\xi$ by $\xi\left(t\right)=\left(\max\left\{t,0\right\}\right)^{2}$ . By a straighforward computation we obtain

[TABLE]

where the last inequality uses the well known fact that for $h\geq 0$ and any expectation $E\left[\xi\left(h\right)h\right]\geq E\left[\xi\left(h\right)\right]E\left[h\right]$ whenever $\xi$ is a nondecreasing function. This establishes the claim.

Using (16) it follows that

[TABLE]

where we used the identity $E_{\beta f}E_{k,\beta f}=E_{\beta f}$ .

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Arcones, M. A. (1995). A Bernstein-type inequality for U-statistics and U-processes. Statistics & probability letters, 22(3), 239-247.
2[2] S.Bernstein , Theory of Probability, Moscow, 1927.
3[3] S.Boucheron,G.Lugosi,P.Massart , Concentration Inequalities using the entropy method, Annals of Probability 31, Nr 3, 2003
4[4] S.Boucheron, G.Lugosi, P.Massart , On concentration of self-bounding functions, Electronic Journal of Probability Vol.14 (2009), Paper no. 64, 1884–1899, 2009
5[5] S. Boucheron, G. Lugosi, P. Massart. Concentration Inequalities, Oxford University Press (2013)
6[6] Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2(Mar), 499-526.
7[7] Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 586-596.
8[8] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The annals of mathematical statistics, 293-325.