Approximating high-dimensional infinite-order $U$-statistics: statistical and computational guarantees
Yanglei Song, Xiaohui Chen, Kengo Kato

TL;DR
This paper develops statistical and computational methods for approximating high-dimensional infinite-order U-statistics, enabling uncertainty quantification in ensemble methods like random forests with guarantees on accuracy and efficiency.
Contribution
It introduces non-asymptotic Gaussian approximation bounds and data-driven bootstrap methods for incomplete IOUS, addressing computational challenges in high dimensions.
Findings
Derived non-asymptotic Gaussian approximation error bounds.
Established statistical guarantees for bootstrap inference.
Provided computational efficiency results for incomplete IOUS.
Abstract
We study the problem of distributional approximations to high-dimensional non-degenerate -statistics with random kernels of diverging orders. Infinite-order -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Approximating high-dimensional infinite-order -statistics: statistical and computational guarantees
Yanglei Songlabel=e1][email protected] [
Xiaohui Chenlabel=e2][email protected] [
Kengo Katolabel=e3][email protected] [
Department of Mathematics and Statistics, Queen’s University, 48 University Ave, Kingston, ON, Canada, K7L 3N6
Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign, IL 61820
Department of Statistics and Data Science, Cornell University, 1194 Comstock Hall, Ithaca, NY 14853
University of Illinois at Urbana-Champaign and Cornell University
Abstract
We study the problem of distributional approximations to high-dimensional non-degenerate -statistics with random kernels of diverging orders. Infinite-order -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.
Infinite-order -statistics,
incomplete -statistics,
Gaussian approximation,
bootstrap,
random forests,
uncertainty quantification,
keywords:
\arxiv
arXiv:1901.01163 \startlocaldefs
\endlocaldefs
and
1 Introduction
Let be independent and identically distributed (i.i.d.) random variables taking value in a measurable space with common distribution , and let be a symmetric and measurable function with respect to the product space equipped with the product -field ( times). Assume for , and consider the statistical inference on the mean vector . A natural estimator for is the -statistic with kernel :
[TABLE]
where is the set of all ordered -tuples of and denotes the set cardinality. The positive integer is called the order or degree of the kernel or the -statistic . We refer to [21] as an excellent monograph on -statistics.
In the present paper, we are interested in the situation where the order may be nonneglible relative to the sample size , i.e., as . -statistics with divergent orders are called infinite-order -statistic (IOUS) [14]. IOUS has attracted renewed interests in the recent statistics and machine learning literature in relation to uncertainty quantification for Breiman’s bagging [3] and random forests [4]. In such applications, the tree-based prediction rules can be thought of as -statistics with deterministic and random kernels, respectively, and their order corresponds to the sub-sample size of the training data [23]. Statistically, the subsample size used to build each tree needs to increase with the total sample size to produce reliable predictions. As a leading example, we consider construction of simultaneous prediction intervals for a version of random forests discussed in [23].
Example 1.1** (Simultaneous prediction intervals for random forests).**
Consider a training dataset of size , , where is a vector of features and is a response. Let be a deterministic prediction rule that takes as input a sub-sample and outputs predictions on testing points in the feature space . Then in (1) are the overall predictions by averaging over all possible sub-samples of size .
For random forests [4, 23], the tree-based prediction rule may be constructed with additional randomness: in building a tree or multiple trees based on a sub-sample, the split at each node may only occur on a randomly selected subset of features. Thus, let be a collection of i.i.d. random variables taking value in a measurable space that are independent of the data , and that determine the potential splits for each sub-sample. Here, each captures the random mechanism in building a prediction function based on , but are assumed to be independent for different sub-samples. Further, let be an -measurable function, that represents the random forest algorithm, such that . Then predictions of random forests are given by a -dimensional -statistic with random kernel :
[TABLE]
where the random kernel varies with .
Compared to -statistics with fixed orders (i.e., being fixed), the analysis of IOUS brings nontrivial computational and statistical challenges due to increasing orders. First, even for a moderately large value of , exact computation of all possible trees is intractable. For diverging , it is not possible to compute in polynomial-time of . Second, the variance of the Hájek projection (i.e., the first-order term in the Hoeffding decomposition [19]) of tends to zero as . To wit, define a function by , and
[TABLE]
Then the Hájek projection of is given by . By the orthogonality of the projections, we have
[TABLE]
Thus the variances of the kernel and its associated Hájek projection have different magnitudes. In particular, if the variance of is bounded by a constant (which is often assumed for random forests, cf. [23]), then , which vanishes as diverges. Thus standard Gaussian approximation results in literature are no longer applicable in our setting since they require that the componentwise variances are bounded below from zero to avoid degeneracy, i.e., there is an absolute constant such that (cf. [11, 10, 6]).
In this work, our focus is to derive computationally tractable and statistically valid sub-sampling procedures for making inference on with a class of high-dimensional random kernels (i.e., large ) of diverging orders (i.e., increasing ). To break the computational bottleneck, we consider the incomplete version of by sampling (possibly much) fewer terms than . In particular, we consider the Bernoulli sampling scheme introduced in [8]. Given a positive integer , which represents our computational budget, define the sparsity design parameter , and let be a collection of i.i.d. Bernoulli random variables with success probability , that are independent of the data and . Consider the following incomplete -statistic (on the data ) with random kernel and weights:
[TABLE]
Obviously, is an unbiased estimator of and it only involves computing terms, which on average is much smaller than if .
When the kernel is both deterministic and of fixed order, finite sample bounds for the Gaussian and bootstrap approximations of (after a suitable normalization) are established in [8]. Roughly speaking, error bound analysis in [8] has two major steps: establish the Gaussian approximation to the Hájek projection, and bound the maximum norm of all higher-order degenerate terms. As discussed above, the first-order Hájek projection in the Hoeffding decomposition is asymptotically vanishing for the IOUS, and we must control the moments of an increasing number of degenerate terms, which makes the analysis of the incomplete IOUS with random kernels substantially more subtle.
In Section 2, we derive non-asymptotic Gaussian approximation error bounds for approximating the distribution of the incomplete IOUS with random kernels subject to sub-exponential moment conditions. Specifically, our rates of convergence for the Gaussian approximation of have the explicit dependence on all parameters , where is an upper bound for the norms of the random kernels (for precise statements, see conditions (C3), (C4), and (C3’) ahead). In particular, asymptotic validity of the Gaussian approximation can be achieved if . The order of will be application specific. As we shall verify in Section 4, under certain regularity conditions,
[TABLE]
It is worth noting that (4) is sharp in the sense that for the linear kernel , we have if . If further , and (i.e., the computational complexity is at least linear in sample size), then the order of is allowed to increase at the rate of for any . On the other hand, the dimension may grow exponentially fast in sample size (i.e., for some constant ) to maintain the asymptotic validity of Gaussian approximations while is still allowed to increase at a polynomial rate in .
The proof of our Gaussian approximation results for IOUS builds upon a number of recently developed technical tools such as Gaussian approximation results for sum of independent random vectors and -statistics of fixed orders [11, 10, 6, 7], anti-concentration inequality for Gaussian maxima [9], and iterative conditioning argument for high-dimensional incomplete -statistics (with the fixed kernel and order) [8]. However, there are three technical innovations in our proof to accommodate the issues of diverging orders and randomness of the kernel. First, we use the iterative renormalization for each dimension of and also by its variance. This simple trick turns out to be the crux to avoid the lower bound assumption for Gaussian approximation in the literature [10, 8]. Second, we derive an order-explicit maximal inequality for the expected supremum of the remainder of the Hájek projection of the IOUS (cf. Section 5). This maximal inequality is new in literature and our main tools include a symmetrization inequality of [27] and Bonami inequality [13, Theorem 3.2.2] for the Rademacher chaos, both with the explicit dependence on . Third, we develop new tail probability inequalities for -statistics with random kernels by leveraging the independence between and the data .
In Section 3, we derive computationally tractable and fully data-driven inferential methods of based on the incomplete IOUS when the sample size , the dimension , and the order , are all large. We consider a multiplier bootstrap procedure consisting of two partial bootstraps that are conditionally independent given and : one estimates the covariance matrix of the randomized kernel, and the other estimates the Hájek projection. The latter is usually computationally demanding, and we develop a divide and conquer algorithm to maintain the overall computational cost of our multiplier bootstrap procedure at most , where denotes the number of bootstrap iterations. Thus the computational cost of the bootstrap to approximate the sampling distribution for incomplete IOUS can be made independent of the order , even though diverges.
In Section 4, we discuss the key non-degeneracy condition (4) for deriving the validity of Gaussian and bootstrap approximations. We provide a general embedding scheme where a Cramér-Rao type lower bound can be established for the minimum of the projection variances. Specifically, the lower bound for only involves the sensitivity of under perturbation and the Fisher information of the embedded family, which in some cases remain constants as diverges. In non-parametric regressions, there is a natural embedding of the response variable into a location family such that the sensitivity and Fisher information can be explicitly computed.
1.1 Connections to the literature
For univariate -statistics (), the asymptotic distributions are derived in the seminal paper [19] for the non-degenerate case. [14] introduced the notion “infinite-order statistics” (IOUS) with diverging orders and established the central limit theorem for when . For univariate IOUS, asymptotic normality of IOUS can be found in [2, Chapter 4.6], and the Berry-Esseen type bounds for IOUS were established by [16, 30, 31]. Further, [23] applied IOUS to construct a prediction interval for one test point. However, [23] does not address the issue that the variance of the Hájek projection is vanishing: the two conditions in Theorem 1 therein, and , are not compatible based on our previous discussions ; in practice, the size of a test set may be comparable to or even much larger than the size of a training set, and the current work is motivated by such consideration. Limit theorems of the related infinite-order -statistics and the infinite-order -processes were studied in [28, 18]. The high-dimensional Gaussian approximation results and bootstrap methods were established in [11, 10] for sum of independent random vectors, and in [6, 8] for -statistics. We refer readers to these references for extensive literature review.
Incomplete -statistics were first introduced in [1], which can be viewed as a special case of weighted -statistics. There is a large literature on limit theorems for weighted -statistics; see [26, 24, 22, 25]. The asymptotic distributions of incomplete -statistics (for fixed ) were derived in [5] and [20]; see also Section 4.3 in [21] for a review on incomplete -statistics. Recently, incomplete U-statistics have gained renewed interests in the statistics and machine learning literatures [12, 23]. To the best of our knowledge, the current paper is the first work that establishes distributional approximation theorems for incomplete IOUS with random kernels and increasing orders in high dimensions.
The remaining of the paper is organized as follows. We develop Gaussian approximation results for above -statistics in Section 2, and bootstrap methods for the variance of the approximating Gaussian distribution in Section 3. We apply the theoretical results to several examples in Section 4. We highlight a maximal inequality in Section 5, and present all other proofs in Appendix A.
1.2 Notation
We write if there exists a finite and positive absolute constant such that . We shall use to denote finite and positive absolute constants, whose value may differ from place to place. We denote by for .
For , let denote the largest integer that does not exceed , and . For , we write if for , and write for the hyperrectangle if . We denote by the collection of hyperrectangles in . Further, for , , is a vector in with component being . For a matrix , denote . For a diagonal matrix with positive diagonal entries, (resp. ) is the diagonal matrix, with -th diagonal entry being (resp. ).
For , let be a function defined by , and for any real-valued random variable , define . Further, we define a family of functions on indexed by . For , define . For , define , , and .
For a generic random variable , let and denote the conditional probability and expectation given , respectively. Further, we write “a.s.” for “almost surely” and “w.r.t.” for “with respect to”. Throughout the paper, we assume that , , , .
2 Gaussian approximations for IOUS
In this section, we shall derive non-asymptotic Gaussian approximation error bounds for: (i) the IOUS with random kernel in (2), which includes the IOUS with deterministic kernel in (1) as a special case, and (ii) the incomplete IOUS in (3) under the Bernoulli sampling scheme.
Recall that , , , and . Further, define
[TABLE]
Clearly, for , and thus . Define two diagonal matrices and such that
[TABLE]
Let and be two independent -dimensional zero mean Gaussian random vectors with variance and respectively. We may take and to be independent of any other random variables. Further, for any two zero mean -dimensional random vectors and ,
[TABLE]
where we recall that is the collection of hyperrectangles in .
Finally, in view of the discussions in the Introduction (Section 1) and to simplify presentation, we assume . Otherwise, the conclusions in this paper hold with replaced by .
2.1 IOUS with random kernel
We start with . Define for , , and ,
[TABLE]
We make following assumptions: there exist and an absolute constant such that
[TABLE]
Clearly, if a.s. for , then the latter three conditions hold. Indeed, (C3) and (C4) follow immediately from the definition, and (C2) is due to the observation that .
Theorem 2.1**.**
Assume (C1-ND), (C2), (C3) and (C4) hold. Then
[TABLE]
where , and means up to a multiplicative constant that only depends on .
Proof.
See Section A.3. We highlight that a key step to establish Theorem 2.1 is to control the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel (See Theorem 5.1). Then the Gaussian approximation result for IOUS follows from Gaussian approximation results for sum of independent random vectors [10] and anti-concentration inequality [9], by a similar argument in [8] with proper normalization.
Clearly, in the special case of non-random kernel, i.e., , (C4) trivially holds. Thus we have the following immediate result for the IOUS with deterministic kernel in (1).
Corollary 2.2**.**
Assume (C1-ND), (C2) and (C3) hold. Then
[TABLE]
*where , and means up to a multiplicative constant that only depends on . *
Remark 2.3** (Comparisons with existing results for ).**
For the univariate IOUS with non-random kernels, asymptotic normality and its rate of convergence are well understood in literature; see [2] for a survey of results in this direction. In [31], a Berry-Esseen bound is derived for symmetric statistics, which include IOUS (with non-random kernels) as a special case. In particular, applying Corollary 4.1 in [31] to IOUS, the rate of convergence to normality is of order for a bounded kernel, which implies that asymptotic normality requires (at least) . A related Berry-Esseen bound is given in [16]. In both papers, the rates of convergence are suboptimal. For elementary symmetric polynomials (which are -statistics corresponding to the product kernel ), it is shown in [30] that the sharp rate of convergence to normality is of order , provided that , and . This result implies that asymptotic normality for the IOUS with the product kernel is achieved when . If , which holds under regularity conditions in Lemma 4.1, our Corollary 2.2 with implies that the rate of convergence for high-dimensional IOUS is (with suitably bounded moments). In particular, Gaussian approximation is asymptotically valid if and for any . Even though our result is valid for a smaller range of and the rate is slower than the optimal rate in the case , Corollary 2.2 does allow the dimension to grow sub-exponentially fast in sample size, which is a useful feature for high-dimensional statistical inference. In addition, to the best of our knowledge, the validity of bootstrap procedures proposed in Section 3 to approximate the sampling distribution of IOUS (on hyperrectangles in ) are new in literature.
2.2 Incomplete IOUS with random kernel
Now we consider , where we recall that is some given computational budget. We will assume the following conditions: for ,
[TABLE]
Clearly, (C4) and (C3’) implies (C3) up to a multiplicative constant. Further, (C3’) and (C5) hold if a.s. for .
Theorem 2.4**.**
Assume (C1-ND), (C2), (C4), (C3’) and (C5) hold. Then
[TABLE]
where , , , means up to a multiplicative constant that only depends on , and we recall that , and are independent.
Proof.
See Section A.4.4.
Remark 2.5**.**
If , then and . Since for any random variable and , we may assume without loss of generality that in the proof. When is fixed, , the kernel is deterministic, and there exists some absolute constant such that , then the above Theorem recovers Theorem 3.1 from [8].
Further, by first conditioning on , we have
[TABLE]
where for two square matrices, means is positive semi-definite. Thus the random kernel increases the variance of the approximating Gaussian distribution compared to the associated deterministic kernel .
3 Bootstrap approximations
In Section 2.2, we have seen that the incomplete -statistic with random kernel is approximated by a Gaussian distribution . However, the covariance term is typically unknown in practice. In this section, we will estimate and by bootstrap methods.
3.1 Bootstrap for
Let be the data involved in the definition of , and take a collection of independent random variables that is independent of the data . Define the following bootstrap distribution:
[TABLE]
The next theorem establishes the validity of .
Theorem 3.1**.**
Assume the conditions (C1-ND) (C2), (C4), (C3’) and (C5) hold. If
[TABLE]
for , , some constants and , then there exists a constant depending only on , and such that with probability at least ,
[TABLE]
Proof.
See Section A.5.1.
3.2 Bootstrap for the approximating Gaussian distribution
Let , and . Further, consider a collection of -measurable -valued random vectors , where is some “good” estimator of , and its form is specified later. We use the following quantity to measure the quality of as an estimator of
[TABLE]
Define and consider the following bootstrap distribution for :
[TABLE]
where is a collection of independent random variables that is independent of and .
Lemma 3.2**.**
Assume the conditions (C1-ND), (C2) and (C3’) hold. If
[TABLE]
for , some constants , and . Then there exists a constant depending only on , and such that with probability at least ,
[TABLE]
where we recall that .
Proof.
See Subsection A.5.2.
Hereafter we consider a special case of the divide and conquer bootstrap algorithm in [8] to estimate . For each , partition the remaining indexes, , into disjoint subsets , each of size , where .
Now define for each and ,
[TABLE]
Finally, define
[TABLE]
Theorem 3.3**.**
Assume the conditions (C1-ND) (C2), (C4) (C3’) and (C5) hold. If
[TABLE]
for , , some constants , . For any , there exists a constant depending only on , , and such that with probability at least ,
[TABLE]
Proof.
See Subsection A.5.3.
3.3 Simultaneous confidence intervals
We first combine the Gaussian approximation result with the bootstrap result.
Corollary 3.4**.**
Assume (C1-ND), (C2) (C4) (C3’) and (C5) hold. Further, assume that for some constants , , (12) holds. Then there exists a constant depending only on , and such that with probability at least ,
[TABLE]
Proof.
It follows from Theorem 2.4 and Theorem 3.3 (with ).
In simultaneous confidence interval construction, it is sometimes desirable to normalize the variance of each dimension, so that if we use maximum-type statistics, the critical value is not dominated by terms with large variance. Define for ,
[TABLE]
which are the diagonal elements in the conditional covariance matrices of (10) and (7) respectively. Further, define a diagonal matrix with
[TABLE]
Corollary 3.5**.**
Assume the conditions in Corollary 3.4. Then there exists a constant depending only on , and such that with probability at least ,
[TABLE]
Consequently,
[TABLE]
Proof.
See Subsection A.5.4.
Remark 3.6**.**
From Corollary 3.5, we can immediately construct confidence intervals for in a data-dependent way. Specifically, let be a quantile of the conditional distribution of given . Then one way to construct simultaneous confidence intervals with confidence level is as follows: for , .
4 Applications
In many applications, does not admit an explicit form, and thus it is usually hard to compute in conditions (C1-ND) and (12) directly. When the kernel has special structures, we can establish a lower bound on with explicit dependence on , which can be applied to Example 1.1. We shall give additional examples in Section 4.3 and 4.4 to illustrate the usefulness of -statistics as a tool to estimate and make inference of certain statistical functionals of . In Section 4.3 for the expected maximum and log-mean functionals, we also establish a lower bound on with explicit dependence on . In Section 4.4 for the kernel density estimation problem, is assumed to be fixed, but we allow the diameter of the design points to diverge.
For simplicity of the presentation, in this section, we assume that all involved derivatives and integrals exist and are finite, and that the order of integrals and the order of integral and differentiation can be exchanged. These assumptions can be justified under standard smoothness and moment conditions. For illustration, we use in (C4) and (C3’).
4.1 Lower bound for
Suppose that the distribution of has a density function with respect to some -finite (reference) measure , i.e.,
[TABLE]
We first embed into a family of densities , where is an open neighborhood of . Such embeddings always exist and below are some examples for .
Location and scale family. If is the Lebesgue measure on , we may consider the following location or scaling families: for ,
[TABLE] 2. 2.
Exponential family. If for , then we may consider the exponential family:
[TABLE] 3. 3.
Additive noise model. Let be a -dimensional random vector independent of , whose distribution is absolutely continuous w.r.t. , then has a density given by the convolution of those of and .
For , define the following perturbed expectation
[TABLE]
where denotes the expectation when have density . Further, define
[TABLE]
where denotes the gradient (or derivative when is a scalar) with respect to and denotes the covariance matrix when have the density . Thus is the score function and is the Fisher-information for a single observation.
Lemma 4.1**.**
If we assume is positive definite, then
[TABLE]
In particular, if there exists an absolute positive constant such that
[TABLE]
then .
Proof.
See Subsection A.6.
4.2 Simultaneous prediction intervals for random forests
Consider the Example 1.1 and assume that has density w.r.t. the product measure on , i.e., for ,
[TABLE]
That is, the feature has the density w.r.t. some -finite measure on , and thus is allowed to have both continuous and discrete components. The response given has a conditional density w.r.t. the Lebesgue measure.
For many regression algorithms such as tree based methods, if we fix the features and increase the responses of training samples by , the prediction at any test point will increase by , i.e., ,
[TABLE]
which implies that . Now we consider the embedding into the “location” family . Observe that
[TABLE]
which implies that . In addition,
[TABLE]
Thus if we assume that there exists such that
[TABLE]
then (13) reduces to . If further we assume that a.s. for some constant and each (this holds for example when the response is bounded a.s.), then the conditions (C2), (C3), (C4) and (C5) hold with . With these assumptions, the condition (12) in Corollary 3.5 simplifies as
[TABLE]
Thus if for some , , and , then Corollary 3.5 can be used to construct asymptotically valid simultaneous prediction intervals with the error of approximation decaying polynomially fast in .
Remark 4.2** (Fisher information in nonparametric regressions).**
Let us take a closer look at the condition (14). Consider the nonparametric regression model
[TABLE]
where is a deterministic measurable function, and are i.i.d. with some density with respect to the Lebesgue measure. Then and thus
[TABLE]
where for the last equality, we first perform integration w.r.t. and apply a change-of-variable. Thus only depends the density of the noise.
4.3 Expected maximum and log-mean functionals
Next we compute the lower bounds on for two additional statistical functionals.
Example 4.3**.**
Let and consider the following two kernels: for ,
[TABLE]
In the former case, we are interested in estimating the expectation for the coordinate-wise maxima of independent random vectors, . In the latter, we assume for and are interested in estimating . In both cases, the coordinates of can have arbitrary dependence, and we allow .
Consider the first kernel in Example 4.3, where , and for . Assume has a density w.r.t. the Lebesgue measure on for , and we consider the following embedding . As in the previous example, for
[TABLE]
Thus, by Lemma 4.1, if we assume for some absolute positive constant
[TABLE]
we have . Further, if we assume that there exists a positive constant such that
[TABLE]
then by maximal inequality (e.g., see [29, Lemma 2.2.2]), . Then if we select , the conditions (C2), (C3) and (C5) hold. Further, (C4) trivially holds for non-random kernels. With above assumptions and selection of , the condition (12) in Corollary 3.5 simplifies as .
Now consider the second kernel in Example 4.3, where and for . Assume has a density w.r.t. the Lebesgue measure on for , and consider the following embedding . As before, it is easy to see that for ,
[TABLE]
Thus if there exists a constant such that , then . Further, if there exists a constant such that
[TABLE]
then the conditions (C2), (C3), (C4) and (C5) hold with . With these assumptions, the condition (12) in Corollary 3.5 simplifies as .
4.4 Kernel density estimation
Example 4.4** (Kernel density estimation).**
Let be a measurable function that is symmetric in its arguments, and be design points. [15, 17] used as a kernel density estimator (KDE) for the density of at the given design points with
[TABLE]
where is a bandwidth parameter, and is the density estimation kernel with , which should not be confused with the -statistic kernel . For this example, we will assume fixed and the bandwidth , but allow the diameter of the design points, , to grow, where denotes the usual Euclidean norm.
Assume that given , has a density w.r.t. the Lebesgue measure on , i.e., for any . Then by definition, for ,
[TABLE]
For , denote
[TABLE]
As in [15], if and , then for any fixed . If there exists some such that for any and , under mild continuity assumptions (e.g. the equicontinuty of ), there exists an absolute constant such that for large . Then we can apply the result in [8], which does not allow to vanish.
In this work, we allow to vanish, and thus allow the diameter of the design points to grow as becomes large. Specifically, if we assume is bounded by some constant , we can select in conditions (C2), (C3), (C4) and (C5). Then the condition (12) in Corollary 3.5 simplifies as
[TABLE]
Thus if and , to apply Corollary 3.5, we require that for any .
Remark 4.5**.**
[15] considers the case , and shows the -convergence rate of the KDE. The same discussion applies here. [17] constructs confidence bands (without computational considerations and bootstrap results) for the density of , under the additional assumptions required to establish the convergence of empirical processes.
5 Maximal inequality
In this section, we derive an upper bound on the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel. This maximal inequality (with the explicit dependence on ) serves as a key step to establish the Gaussian approximation result for the incomplete IOUS with random kernel.
Theorem 5.1**.**
Assume (C3) hold. Then there exist constants , depending only on , such that if , then
[TABLE]
The proof of Theorem 5.1 is quite involved: we need to develop a number of technical tools such as the symmetrization inequality and Bonami inequality (i.e., exponential moment bound) for the Rademacher chaos, all with the explicit dependence on .
We start with some notation. Let be an independent copy of , and be i.i.d. Rademacher random variables, i.e., , that are independent of and . If all involved random variables are independent, we write (resp. ) for expectation only w.r.t. (resp. ).
For a given probability space , a measurable function on X and , we use the notation whenever the latter integral is well-defined, and denote the Dirac measure on , i.e., for any . For a measurable symmetric function on and , let denote the function on defined by
[TABLE]
whenever it is well defined. To prove Theorem 5.1, without loss of generality, we may assume
[TABLE]
since we can always consider instead. For , define
[TABLE]
Clearly is degenerate of order with respect to the distribution in the sense of (16) below. For any , and where , define
[TABLE]
Then
Further, the Hoeffding decomposition [19] for the -statistic (with ) is as follows:
[TABLE]
Finally, for any , define the envelope function
[TABLE]
5.1 Symmetrization inequality
For each integer , consider a symmetric kernel . We say that is degenerate of order with respect to the distribution if
[TABLE]
The following result is essentially due to [27, Section 3, Symmetrization inequality] in the -process setting. We provide a self-contained (and perhaps more transparent) proof for completeness.
Theorem 5.2** (Symmetrization inequality).**
Assume (16) holds.
[TABLE]
Remark 5.3**.**
In Theorem 5.2, the symmetrization costs a multiplicative factor of for a degenerate kernel of order . Standard symmetrization argument for such degenerate -statistics (cf. [13, Theorem 3.5.3]) together with the decoupling inequalities (cf. [13, Theorem 3.1.1]) in literature yield that
[TABLE]
where . Since , improvement of the constant to the exponential growth in turns out to be crucial to obtain the maximal inequality for the IOUS in Theorem 5.1. The major component for the super-exponential behavior of is due to the step for applying the decoupling inequality in [13, Theorem 3.1.1], which is valid for any (measurable) symmetric kernel. If the kernel is degenerate of order , then symmetrization can be directly done without the decoupling inequality (cf. the proof of Theorem 5.2 below).
Proof of Theorem 5.2.
Define a new sequence of random variables :
[TABLE]
Further, for each , define
[TABLE]
Due to degeneracy, we have
[TABLE]
where the first and third equalities follow from definitions and Fubini Theorem, and the second follows from the degeneracy. To wit, on the event that for some ,
[TABLE]
The rest of the argument is standard: by Jensen’s inequality,
[TABLE]
Since and have the same distribution, taking expectation on both sides completes the proof.
5.2 Maximal inequality
We start with a lemma, whose proof is elementary and thus omitted. Recall the definition of in Subsection 1.2.
Lemma 5.4**.**
For any , is strictly increasing, convex, and . Further, for any ,
[TABLE]
and consequently
[TABLE]
Now we state the maximal inequality with explicit constants.
Lemma 5.5**.**
Fix . Consider a sequence of non-negative random variables , and assume that there exists some real number such that . Then
[TABLE]
Proof.
By monotonicity and convexity,
[TABLE]
Then the proof is complete by Lemma 5.4.
5.3 Exponential moment of Rademacher chaos
The goal is to establish an exponential moment bound (i.e., Bonami inequality) of Rademacher chaos of order . Based on the well-known hyper-contractivity of Rademacher chaos variables in literature (cf. [13, Corollary 3.2.6]), our Lemma 5.6 below provides an exponential moment bound with an explicit dependence on the order.
Lemma 5.6** (Exponential moment of Rademacher chaos).**
Fix , and let be a collection of real numbers. Consider the following homogeneous chaos of order :
[TABLE]
where are i.i.d. Rademacher random variables. Then
[TABLE]
Proof.
Denote , and thus . Observe that and . From [13, Theorem 3.2.2], we have for any
[TABLE]
Here, the first inequality clearly holds for , and we use [13, Theorem 3.2.2] for . Then using the fact that and by Lemma 5.4, we have
[TABLE]
Using the fact that , we have
[TABLE]
Since , we have
[TABLE]
which completes the proof.
5.4 Proof of Theorem 5.1
Now we are in position to prove Theorem 5.1. Recall that we assume . First, for each and , define
[TABLE]
where is defined in (15), and are i.i.d. Rademacher random variables. Define
[TABLE]
By Jensen’s inequality and the fact that , we have for any ,
[TABLE]
Then by Lemma 5.6,
[TABLE]
Further, by Lemma 5.5 with , we have
[TABLE]
Then by Lemma 5.2 and Jensen’s inequality, we have
[TABLE]
Now we bound . By the definition of , condition (C3), Lemma 5.4 and Jensen’s inequality, we have
[TABLE]
Since , by Jensen’s inequality, we have . Then by the standard maximal inequality (e.g., see [29, Lemma 2.2.2]), there exists a constant , depending only on , such that for ,
[TABLE]
Thus we obtain that
[TABLE]
Observe that if , we have for any
[TABLE]
Further, for any , . Now, take , and in particular . Then
[TABLE]
For the first term, by geometric series formula,
[TABLE]
For the second term, since for any , , we have
[TABLE]
which completes the proof of Theorem 5.1.
Appendix A Proofs
A.1 Tail probabilities
In this section, we collect and prove some results regarding tail probabilities for sum of independent random vectors, -statistics, and -statistics with random kernels. For each type of statistics, we present two versions, one for non-negative random variables and the other for general cases.
These inequalities are used in bounding the effects due to sampling (Subsection A.4.3), and also in controlling the distance between the bootstrap covariance matrices and their targets (Section A.5).
A.1.1 Tail probabilities for sum of independent random vectors
In this subsection, are all integers.
Lemma A.1**.**
Let be independent -valued random vectors and . Assume that
[TABLE]
Then there exists some constant that only depends on such that
[TABLE]
Proof.
See Subsection A.7.1.
Lemma A.2**.**
Let be independent -valued random vectors and . Assume that
[TABLE]
Then there exists some constant that only depends on such that
[TABLE]
where .
Proof.
See Subsection A.7.2
Lemma A.3**.**
Let be independent and identical distributed Bernoulli random variables with success probability , i.e., for . Further, let be deterministic vectors. Then there exists an absolute constant such that
[TABLE]
where and .
Proof.
See Subsection A.7.3
A.1.2 Tail probabilities for -statistics
Lemma A.4**.**
Let be i.i.d. random variables taking value in and fix . Let be a measurable, symmetric function such that ,
[TABLE]
Define . Then there exists a constant that only depends on such that
[TABLE]
Clearly, we can replace by .
Proof.
See Subsection A.7.4.
Lemma A.5**.**
Let be i.i.d. random variables taking value in and fix . Let be a measurable, symmetric function such that
[TABLE]
Define and . Then there exists a constant that only depends on such that
[TABLE]
Clearly, we can replace by .
Proof.
See subsection A.7.5.
A.1.3 Tail probabilities for -statistics with random kernel
Let be i.i.d. random variables taking value in and be i.i.d. random variables taking value in , that are independent of . In this subsection, we consider a measurable function that is symmetric in the first variables, and fix some . Further, define
[TABLE]
We first consider the non-negative random kernels.
Lemma A.6**.**
Consider Assume that for all , , and that there exists such that
[TABLE]
Then there exists some constant that only depends on such that with probability at least ,
[TABLE]
Proof.
See subsection A.7.6.
Next, we consider centered random kernels.
Lemma A.7**.**
Consider Assume there exists such that for all ,
[TABLE]
Then there exists some constant that only depends on such that with probability at least ,
[TABLE]
Proof.
See subsection A.7.7.
A.2 Additional lemmas
The following Lemma concerns Gaussian approximation for sum of independent vectors. It replaces the condition in Proposition 2.1 of [10] by .
Lemma A.8**.**
Let be independent -valued random vectors. Assume that for some absolute constant , and ,
[TABLE]
Then there exists some constant that only depends on and such that
[TABLE]
where , , and .
Proof.
See Subsection A.8.
The following lemmas are elementary, but used repeatedly.
Lemma A.9**.**
Let . There exits a constant , only depending on , such that for any positive integers such that ,
[TABLE]
Proof.
Fix . If , . Thus there exits such that if , . For , the inequality holds with .
Lemma A.10**.**
Let . For any random variable ,
[TABLE]
Proof.
Observe that
[TABLE]
which implies that . The reverse direction is similar.
For , is not a norm , but the usual triangle inequality and maximal inequality hold up to a multiplicative constant.
Lemma A.11**.**
Fix .
- (i)
For any random variables and ,
[TABLE] 2. (ii)
Let be a sequence of random variables such that for , and . Then there exists a constant depending only on such that
[TABLE]
Proof.
See Subsection A.8.
A.3 Proofs in Section 2.1
We first prove Corollary 2.2 and then prove Theorem 2.1.
Proof of Corollary 2.2.
Let be the constant in Theorem 5.1. Without loss of generality, we assume
[TABLE]
since and we can always consider instead. Recall that .
Fix any rectangle , where and . Define
[TABLE]
Denote
[TABLE]
Then by Theorem 5.1,
[TABLE]
For any , by Markov inequality and definition,
[TABLE]
Due to assumptions (C2), (C3) and Cauchy-Schwarz inequality,
[TABLE]
Then due to Lemma A.8, we have
[TABLE]
Further, by anti-concentration inequality [10, Lemma A.1],
[TABLE]
Finally, taking and due to convention (17), we have
[TABLE]
Likewise, we can show the lower inequality
[TABLE]
which completes the proof.
Proof of Theorem 2.1.
As before, without loss of generality, we assume
[TABLE]
for some sufficiently small . Define for each ,
[TABLE]
Then by definition,
[TABLE]
Step 1. We first show that
[TABLE]
Note that conditional on , is an average of independent random vectors. Thus by [9, Lemma 8],
[TABLE]
By definition (6) and maximal inequality ([29, Lemma 2.2.2] and Lemma A.11),
[TABLE]
Define
[TABLE]
Under the assumption (C4) and again maximal inequality ([29, Lemma 2.2.2] and Lemma A.11), we have
[TABLE]
Then, we have
[TABLE]
Then due to Lemma A.9 and (18), we have
[TABLE]
Step 2. We finish the proof by a similar argument as in the proof of Corollary 2.2.
Fix any rectangle , where and . Define
[TABLE]
where we recall that is defined in (5). Recall that . For any , by Markov inequality, the result from Step 1, and Corollary 2.2,
[TABLE]
Observe that for . By anti-concentration inequality [10, Lemma A.1],
[TABLE]
Finally, taking and due to convention (18), we have
[TABLE]
By a similar argument, we can show
[TABLE]
which completes the proof.
A.4 Proofs in Section 2.2
In this subsection, without loss of generality, we assume . Recall the definition in (5). Further, define a function by , and
[TABLE]
Clearly, if (C5) holds, then
[TABLE]
where again we applied Cauchy–Schwarz inequality for .
A.4.1 Bounding
The following lemma follows from an application of Bernstein’s inequality and is proved in the Step 5 of the proof of [8, Theorem 3.1]. It is included here for easy reference.
Lemma A.12**.**
Assume . Then
[TABLE]
A.4.2 Bounding the normalized covariance estimator
Lemma A.13**.**
Assume (C3’), (C4) and (C5) hold. Then there exists a constant , depending only on , such that with probability at least ,
[TABLE]
Proof.
Define . Observe that
[TABLE]
We will bound these two terms separately.
Step 0. We first make a few observations. Clearly, , and for all , by Jensen’s inequality for conditional expectation and (21),
[TABLE]
Further, by definition
[TABLE]
As a result, by the assumptions (C4) and (C3’), and Lemma A.10,
[TABLE]
Step 1. We bound using Lemma A.7 with and . For , define
[TABLE]
Observe that due to Lemma A.10 and A.11,
[TABLE]
Then due to (23) and the assumptions (C4) and (C3’),
[TABLE]
Now we apply Lemma A.7, with probability at least ,
[TABLE]
Step 2. We bound using Lemma A.5 with . By (22) and (23), with probability at least ,
[TABLE]
Then the proof is complete by combining step 1 and 2.
A.4.3 Bounding the effect of sampling
The following quantity will appear in the proof of Theorem 2.4:
[TABLE]
The next lemma establishes conditional Gaussian approximation for .
Lemma A.14**.**
Suppose the assumptions in Theorem 2.4 hold. There exists a constant , depending on , such that with probability at least ,
[TABLE]
where we recall that , and we abbreviate for .
Proof.
Consider conditionally independent (conditioned on ) -valued random vectors such that
[TABLE]
Clearly, . Further, define
[TABLE]
By triangle inequality, it then suffices to show that each of the following events happens with probability at least ,
[TABLE]
on which we now focus. Without loss of generality, since , we assume
[TABLE]
for some sufficiently small constant that is to be determined. Recall that and .
Step 0. By Lemma A.13 and A.9,
[TABLE]
In particular, since , if we take small enough such that , then .
Step 1. The goal is to show that the first event in (25), , holds with probability at least .
Step 1.1. Define
[TABLE]
Further, , where
[TABLE]
By Theorem 2.1 in [10], there exist absolute constants and such that for any real numbers and , we have
[TABLE]
on the event .
In Step 0, we have shown . In Step 1.2-1.4, we select proper and such that the first two events happen with probability at least . In Step 1.5, we plug in these values.
Step 1.2: Select . Since , , and thus
[TABLE]
We will apply Lemma A.6 with and . Thus for , define
[TABLE]
First, by iterated expectation and due to (21),
[TABLE]
Second, observe that , and thus due to (C3), (C4) and Lemma A.10 and A.11,
[TABLE]
Further, observe that by Lemma A.11,
[TABLE]
Thus by the same argument, . Then by Lemma A.6, with probability at least ,
[TABLE]
Due to Lemma A.9 and assumption (26), . Thus there is a constant , depending on , such that if
[TABLE]
then .
Step 1.3: bounding . Since is a Bernoulli random variable, it is clear that on the event
[TABLE]
where we use the value (30) for .
By assumption (C3’) and Lemma A.11,
[TABLE]
Due to (26),
[TABLE]
Thus if we take in (26) to be sufficiently small such that
[TABLE]
then and .
Step 1.4: select . From Step 1.3, we have shown that
[TABLE]
Then by the same argument as in Step 1.4 of the proof of [8, Theorem 3.1] and due to (26) and , on the event , for any ,
[TABLE]
Thus there exists an absolute constant such that if we set
[TABLE]
then .
Step 1.5: plug in and . Recall the definition and in (30) and (31). With these selections, we have shown that , where we recall that . Further, on the event ,
[TABLE]
which completes the proof of Step 1.
Step 2. The goal is to show that the second event in (25), , holds with probability at least .
Observe that and for . By the Gaussian comparison inequality [8, Lemma C.5],
[TABLE]
on the event that . From (27) in Step 0,
[TABLE]
Thus if we set , then with probability at least ,
[TABLE]
A.4.4 Proof of Theorem 2.4
Without loss of generality, we assume that
[TABLE]
Observe that
[TABLE]
where we recall that and is defined in Section 2.1 and in (24) respectively. Denote .
Step 1: the goal is to show that
[TABLE]
For any rectangle , observe that
[TABLE]
By Lemma A.14, since , we have
[TABLE]
where we recall that is independent of all other random variables. Further, by Theorem 2.1,
[TABLE]
Observe that for any , due to (C3’), and . Then by the Gaussian comparison inequality [8, Lemma C.5] and due to (32)
[TABLE]
Similarly, we can show . Thus the proof of Step 1 is complete.
Step 2: we show that with probability at least ,
[TABLE]
Clearly, . Then due to (C3’), . Since is a multivariate Gaussian, . Then by the maximal inequality [29, Lemma 2.2.2] , which further implies that
[TABLE]
Since , and from the result in Step 1, we have
[TABLE]
Finally, due to Lemma A.12 and (32), we have with probability at least ,
[TABLE]
Since , the proof is complete.
Step 3: final step. Recall that and is defined in Step 2. For any rectangle with , by Step 2,
[TABLE]
Then by the result in Step 1, we have
[TABLE]
where , and . Observe that for , and thus by anti-concentration inequality [10, Lemma A.1],
[TABLE]
where the last inequality is due to (32). Similarly, we can show , and thus
[TABLE]
which completes the proof.
A.5 Proofs in Section 3
In this subsection, without loss of generality, we assume (see Remark 2.5).
A.5.1 Proof of Theorem 3.1
Proof.
Without loss of generality, we can assume , since otherwise we can center first. Recall the definition of in (5), , and in (20). Observe that for any integer , there exists some constant that depends only on and such that
[TABLE]
Step 0. Define and
[TABLE]
Since for , by Gaussian comparison inequality [8, C.5],
[TABLE]
Thus it suffices to show that with probability at least , . Define
[TABLE]
Then clearly .
Without loss of generality, we can assume , since we can always take to be large enough. Then by Lemma A.12, , and thus it suffices to show that
[TABLE]
on which we now focus.
Step 1: bounding . Conditional on , by Lemma A.3,
[TABLE]
where
[TABLE]
First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) and due to (C3’) and Lemma A.10 and A.11,
[TABLE]
As a result, .
Second, we will apply Lemma A.6 to bound with and . Note that by Lemma A.11, for ,
[TABLE]
As a result, due to (C5), (C3) and (C4)
[TABLE]
Then by Lemma A.6 and A.9, and due to (8) and (33)
[TABLE]
Finally, putting the two results together and again by (33), we have
[TABLE]
Then by (8), , which implies that with probability at least ,
[TABLE]
Step 2: bounding . By Lemma A.13 and A.9, and due to assumptions (8) and (33)
[TABLE]
which implies .
Step 3: bounding . By definition, . Then by Lemma A.12 and (8),
[TABLE]
with probability at least .
Step 4: bounding . Define
[TABLE]
Clearly, In the next two sub-steps, we will bound these two terms separately.
Step 4.1: bounding . Conditional on , by Lemma A.3,
[TABLE]
where .
First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) and due to (C3’),
[TABLE]
As a result, .
Second, we will apply Lemma A.6 to bound with and . Define for ,
[TABLE]
By the similar argument as in Step 1,
[TABLE]
Then by Lemma A.6 and A.9, and due to (8) and (33) we have .
Finally, putting the two results together, we have
[TABLE]
Then by (8), , which implies that with probability at least , holds.
Step 4.2: bounding . Observe that , where
[TABLE]
By directly applying Lemma A.7 with , due to (8) and Lemma A.9,
[TABLE]
By directly applying Lemma A.5 with and due to (8),
[TABLE]
Thus .
Combining sub-step 4.1 and 4.2, we have . And combining Step 0-4, we finish the proof.
A.5.2 Proof of Lemma 3.2
Proof.
Without loss of generality, we can assume . Recall the definition is (5). By definition, for . Then by the Gaussian comparison inequality [8, Lemma C.5],
[TABLE]
where
[TABLE]
By the same argument as in the proof of [8, Theorem 4.2],
[TABLE]
where is defined in (9), and
[TABLE]
Step 1: bounding . By the second part of (11), we have
[TABLE]
Step 2: bounding . We apply Lemma A.2 with , and note that :
[TABLE]
where and
[TABLE]
By Lemma A.11, (C2) and (C3’), . Thus
[TABLE]
Then due to the first part of (11) and (33), .
Step 3: bounding . We apply Lemma A.2 with , :
[TABLE]
Then due to the first part of (11) and (33), .
A.5.3 Proof of Theorem 3.3
Proof.
Without loss of generality, we can assume .
Step 1. Let , . Due to Theorem 3.1, Lemma 3.2 and using the same argument as in the Step 3 of the proof of [8, Theorem 4.2], it suffices to show the second part of (11) holds. From the definition (9),
[TABLE]
In Step 2, we will show that
[TABLE]
Then by Markov inequality and (12),
[TABLE]
which completes the proof.
Step 2. The goal is to show (34). Define
[TABLE]
By Jensen’s inequality,
[TABLE]
and for each , conditional on , by Hoffmann-Jorgensen inequality [29, A.1.6.],
[TABLE]
Step 2.1: bounding . Observe that for each ,
[TABLE]
Thus .
Step 2.2: bounding . Observe that for each ,
[TABLE]
Further, by Jensen’s inequality,
[TABLE]
where is defined in Step 1. Then by the same argument as in the proof of [8, Proposition 4.4],
[TABLE]
Step 2.3: combining 2.1 and 2.2. By Jensen’s inequality, assumption (C3’) and by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11)
[TABLE]
Thus combining the results from 2.1 and 2.2, we have
[TABLE]
where the second inequality is due to (12) and that and .
A.5.4 Proof of Corollary 3.5
Proof.
We have shown in Step 0 of the proof (Subsection A.5.1) for Theorem 3.1 that
[TABLE]
Further, if we take in Theorem 3.3, then in the proof for Theorem 3.2 and Theorem 3.3, we have shown that
[TABLE]
The rest of the proof is the same as the proof for [8, Corollary A.1], and thus omitted.
A.6 Proof of Lemma 4.1
Proof.
Clearly, the inequality is for each dimension, and thus without loss of generality, we assume and omit the dependence on .
We denote and the expectation and covariance when have densities . Further, define for and by definition .
First, note that by interchanging the order of integration and differentiation
[TABLE]
Further, by a similar argument,
[TABLE]
which implies that
[TABLE]
Finally, observe that
[TABLE]
which completes the proof.
A.7 Proofs of tail probabilities in Section A.1
A.7.1 Proof of Lemma A.1
Proof.
We first define
[TABLE]
Then by the maximal inequality [29, Lemma 2.2.2], . By [10, Lemma E.4],
[TABLE]
The right hand side is if
[TABLE]
Further by [10, Lemma E.3],
[TABLE]
Combining two parts finishes the proof.
A.7.2 Proof of Lemma A.2
Proof.
We first define
[TABLE]
Then by the maximal inequality [29, Lemma 2.2.2], . By [10, Lemma E.2],
[TABLE]
The right hand side is if
[TABLE]
Further by [10, Lemma E.1],
[TABLE]
Combining two parts finishes the proof.
A.7.3 Proof of Lemma A.3
Proof.
We first define
[TABLE]
By [10, Lemma E.2],
[TABLE]
The right hand side is if
[TABLE]
Further by [10, Lemma E.1],
[TABLE]
Combining two parts finishes the proof.
A.7.4 Proof of Lemma A.4
Proof.
Let , and define the following quantity
[TABLE]
Then by the maximal inequality [29, Lemma 2.2.2], . By [6, Lemma E.3],
[TABLE]
The right hand side is if we set
[TABLE]
Further, by [9, Lemma 9],
[TABLE]
Putting two parts together, we have
[TABLE]
which completes the proof.
A.7.5 Proof of Lemma A.5
Proof.
Let , and define the following quantity
[TABLE]
Then by the maximal inequality [29, Lemma 2.2.2], . By [8, Lemma C.3],
[TABLE]
The right hand side is if we take
[TABLE]
Further, by [9, Lemma 8],
[TABLE]
Putting two parts together completes the proof.
A.7.6 Proof of Lemma A.6
Proof.
First, observe that . Denote
[TABLE]
Then conditional on , by Lemma A.1,
[TABLE]
By Lemma A.4,
[TABLE]
Further, by maximal inequality [29, Lemma 2.2.2]
[TABLE]
Then the proof is complete by combining above results.
A.7.7 Proof of Lemma A.7
Proof.
First, we define
[TABLE]
Then by first conditional on and by Lemma A.2,
[TABLE]
Observe that
[TABLE]
Then by Lemma A.4 with ,
[TABLE]
Further, by maximal inequality [29, Lemma 2.2.2]
[TABLE]
Then the proof is complete by combining above results.
A.8 Proofs of additional lemmas
The following lemma is similar to [10, Lemma C.1], and is needed in proving Lemma A.8.
Lemma A.15**.**
Let , and be a non-negative random variable such that . Then there exists a constant , depending only on , such that
[TABLE]
Proof.
Since , we have for ,
[TABLE]
By change of variable, we have
[TABLE]
Proof of Lemma A.8.
For , it has been established by [10, Proposition 2.1]. For , the proof is almost identical to that for [10, Proposition 2.1], except that we replace [10, Lemma C.1] by Lemma A.15.
Proof of Lemma A.11.
(i). Without loss of generality, we assume , and . Observe that
[TABLE]
(ii). From Lemma 5.4, for ,
[TABLE]
which, by the convexity of and the fact , implies . By the standard maximal inequality (e.g., see [29, Lemma 2.2.2]) and Lemma 5.4, . Thus by Lemma 5.4,
[TABLE]
Now we let such that . Then by Jensen’s inequality ( for a.s.),
[TABLE]
which implies that .
Acknowledgements
X. Chen is supported in part by NSF DMS-1404891, NSF CAREER Award DMS-1752614, and UIUC Research Board Awards (RB17092, RB18099).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Gunnar Blom. Some properties of incomplete U 𝑈 U -statistics. Biometrika , 63(3):573–580, 1976.
- 2[2] Yu. V. Borovskikh. U-Statistics in Banach Spaces . V.S.P. Intl Science, 1996.
- 3[3] Leo Breiman. Bagging predictors. Machine Learning , 24:123–140, 1996.
- 4[4] Leo Breiman. Random forests. Machine Learning , 45:5–32, 2001.
- 5[5] B.M. Brown and D.G. Kildea. Reduced U 𝑈 U -statistics and the Hodges-Lehmann estimator. Annals of Statistics , 6:828–835, 1978.
- 6[6] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and their applications. The Annals of Statistics , 46(2):642–678, 2018.
- 7[7] Xiaohui Chen and Kengo Kato. Jackknife multiplier bootstrap: finite sample approximations to the U 𝑈 U -process supremum with applications. 2017. ar Xiv:1708.02705.
- 8[8] Xiaohui Chen and Kengo Kato. Randomized incomplete u 𝑢 u -statistics in high dimensions. The Annals of Statistics, accepted (available at ar Xiv:1712.00771) , 2018+.
