Approximating high-dimensional infinite-order $U$-statistics:   statistical and computational guarantees

Yanglei Song; Xiaohui Chen; Kengo Kato

arXiv:1901.01163·math.ST·December 11, 2019

Approximating high-dimensional infinite-order $U$-statistics: statistical and computational guarantees

Yanglei Song, Xiaohui Chen, Kengo Kato

PDF

TL;DR

This paper develops statistical and computational methods for approximating high-dimensional infinite-order U-statistics, enabling uncertainty quantification in ensemble methods like random forests with guarantees on accuracy and efficiency.

Contribution

It introduces non-asymptotic Gaussian approximation bounds and data-driven bootstrap methods for incomplete IOUS, addressing computational challenges in high dimensions.

Findings

01

Derived non-asymptotic Gaussian approximation error bounds.

02

Established statistical guarantees for bootstrap inference.

03

Provided computational efficiency results for incomplete IOUS.

Abstract

We study the problem of distributional approximations to high-dimensional non-degenerate $U$ -statistics with random kernels of diverging orders. Infinite-order $U$ -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.

Equations668

U_{n} := \frac{1}{∣ I _{n, r} ∣} ι \in I_{n, r} \sum h (X_{i_{1}}, \dots, X_{i_{r}}) := \frac{1}{∣ I _{n, r} ∣} ι \in I_{n, r} \sum h (X_{ι}),

U_{n} := \frac{1}{∣ I _{n, r} ∣} ι \in I_{n, r} \sum h (X_{i_{1}}, \dots, X_{i_{r}}) := \frac{1}{∣ I _{n, r} ∣} ι \in I_{n, r} \sum h (X_{ι}),

U_{n} := ∣ I_{n, r} ∣^{- 1} ι \in I_{n, r} \sum H (X_{i_{1}}, \dots, X_{i_{r}}, W_{ι}) = ∣ I_{n, r} ∣^{- 1} ι \in I_{n, r} \sum H (X_{ι}, W_{ι}),

U_{n} := ∣ I_{n, r} ∣^{- 1} ι \in I_{n, r} \sum H (X_{i_{1}}, \dots, X_{i_{r}}, W_{ι}) = ∣ I_{n, r} ∣^{- 1} ι \in I_{n, r} \sum H (X_{ι}, W_{ι}),

σ_{g, j}^{2} := E [(g_{j} (X_{1}) - θ_{j})^{2}] for 1 ⩽ j ⩽ d, \underline{σ}_{g}^{2} := 1 ⩽ j ⩽ d min σ_{g, j}^{2} .

σ_{g, j}^{2} := E [(g_{j} (X_{1}) - θ_{j})^{2}] for 1 ⩽ j ⩽ d, \underline{σ}_{g}^{2} := 1 ⩽ j ⩽ d min σ_{g, j}^{2} .

E [(h_{j} (X_{1}, \dots, X_{r}) - θ_{j})^{2}] ⩾ 1 ⩽ i ⩽ r \sum E [(g_{j} (X_{i}) - θ_{j})^{2}] = r σ_{g, j}^{2} .

E [(h_{j} (X_{1}, \dots, X_{r}) - θ_{j})^{2}] ⩾ 1 ⩽ i ⩽ r \sum E [(g_{j} (X_{i}) - θ_{j})^{2}] = r σ_{g, j}^{2} .

U_{n, N}^{'} := N^{- 1} ι \in I_{n, r} \sum Z_{ι} H (X_{ι}, W_{ι}), where N := ι \in I_{n, r} \sum Z_{ι} .

U_{n, N}^{'} := N^{- 1} ι \in I_{n, r} \sum Z_{ι} H (X_{ι}, W_{ι}), where N := ι \in I_{n, r} \sum Z_{ι} .

\underline{σ}_{g}^{- 2} = O (r^{2}) .

\underline{σ}_{g}^{- 2} = O (r^{2}) .

Γ_{g} := Cov (g (X_{1})), Γ_{H} := Cov (H (X_{1}^{r}, W)),

Γ_{g} := Cov (g (X_{1})), Γ_{H} := Cov (H (X_{1}^{r}, W)),

σ_{H, j}^{2} := E [(H_{j} (X_{1}^{r}, W) - θ_{j})^{2}] for 1 ⩽ j ⩽ d .

Λ_{g, j j} := σ_{g, j}^{2} ⩽ σ_{H, j}^{2} := Λ_{H, j j} for 1 ⩽ j ⩽ d .

Λ_{g, j j} := σ_{g, j}^{2} ⩽ σ_{H, j}^{2} := Λ_{H, j j} for 1 ⩽ j ⩽ d .

ρ (U, Y) := R \in R sup ∣ P (U \in R) - P (Y \in R) ∣,

ρ (U, Y) := R \in R sup ∣ P (U \in R) - P (Y \in R) ∣,

B_{n, j} (x_{1}, \dots, x_{r}) := ∥ H_{j} (x_{1}, \dots, x_{r}, W) - h_{j} (x_{1}, \dots, x_{r}) ∥_{ψ_{q}} .

B_{n, j} (x_{1}, \dots, x_{r}) := ∥ H_{j} (x_{1}, \dots, x_{r}, W) - h_{j} (x_{1}, \dots, x_{r}) ∥_{ψ_{q}} .

σ_{g, j}^{2} > 0, for all j = 1, \dots, d,

σ_{g, j}^{2} > 0, for all j = 1, \dots, d,

E ∣ g_{j} (X_{1}) - θ_{j} ∣^{4} ⩽ σ_{g, j}^{2} D_{n}^{2}, for all j = 1, \dots, d,

∥ h_{j} (X_{1}^{r}) - θ_{j} ∥_{ψ_{q}} ⩽ D_{n}, for all j = 1, \dots, d,

∥ B_{n, j} (X_{1}^{r}) ∥_{ψ_{q}} ⩽ D_{n} for all j = 1, \dots, d .

ρ (n (U_{n} - θ), r Y_{A}) ≲ (\frac{r ^{2} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} n})^{1/6},

ρ (n (U_{n} - θ), r Y_{A}) ≲ (\frac{r ^{2} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} n})^{1/6},

ρ (n (U_{n} - θ), r Y_{A}) ≲ (\frac{r ^{2} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} n})^{1/6} .

ρ (n (U_{n} - θ), r Y_{A}) ≲ (\frac{r ^{2} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} n})^{1/6} .

∥ H_{j} (X_{1}^{r}, W) - θ_{j} ∥_{ψ_{q}} ⩽ D_{n}, for all j = 1, \dots, d,

∥ H_{j} (X_{1}^{r}, W) - θ_{j} ∥_{ψ_{q}} ⩽ D_{n}, for all j = 1, \dots, d,

E ∣ H_{j} (X_{1}^{r}) - θ_{j} ∣^{4} ⩽ σ_{H, j}^{2} D_{n}^{2}, for all j = 1, \dots, d .

ρ (n (U_{n, N}^{'} - θ), r Y_{A} + α_{n}^{1/2} Y_{B}) ≲ ϖ_{n}, where ϖ_{n} := (\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} ( n \land N )})^{1/6},

ρ (n (U_{n, N}^{'} - θ), r Y_{A} + α_{n}^{1/2} Y_{B}) ≲ ϖ_{n}, where ϖ_{n} := (\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} ( n \land N )})^{1/6},

Γ_{H}

Γ_{H}

U_{n, B}^{#} := \frac{1}{N} ι \in I_{n, r} \sum ξ_{ι}^{'} Z_{ι} (H (X_{ι}, W_{ι}) - U_{n, N}^{'}) .

U_{n, B}^{#} := \frac{1}{N} ι \in I_{n, r} \sum ξ_{ι}^{'} Z_{ι} (H (X_{ι}, W_{ι}) - U_{n, N}^{'}) .

\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{2}} ( d n )}{( σ _{H}^{2} \land 1 ) ( n \land N )} ⩽ C_{1} n^{- ζ},

\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{2}} ( d n )}{( σ _{H}^{2} \land 1 ) ( n \land N )} ⩽ C_{1} n^{- ζ},

R \in R sup P_{∣ D_{n}} (U_{n, B}^{#} \in R) - P (Y_{B} \in R) ⩽ C n^{- ζ /6} .

R \in R sup P_{∣ D_{n}} (U_{n, B}^{#} \in R) - P (Y_{B} \in R) ⩽ C n^{- ζ /6} .

Δ_{A, 1} := 1 ⩽ j ⩽ d max \frac{1}{n _{1} σ _{g, j}^{2}} i_{1} \in S_{1} \sum (G_{i_{1}, j} - g_{j} (X_{i_{1}}))^{2} .

Δ_{A, 1} := 1 ⩽ j ⩽ d max \frac{1}{n _{1} σ _{g, j}^{2}} i_{1} \in S_{1} \sum (G_{i_{1}, j} - g_{j} (X_{i_{1}}))^{2} .

U_{n_{1}, A}^{#} := \frac{1}{n _{1}} i_{1} \in S_{1} \sum ξ_{i_{1}} (G_{i_{1}} - \overline{G}),

U_{n_{1}, A}^{#} := \frac{1}{n _{1}} i_{1} \in S_{1} \sum ξ_{i_{1}} (G_{i_{1}} - \overline{G}),

\frac{D _{n}^{2} lo g ^{q_{2}} ( d n )}{σ _{g}^{2} n _{1}} ⩽ C_{1} n^{- ζ_{1}}, and P (Δ_{A, 1} lo g^{4} (d) > C_{1} n^{- ζ_{2}}) ⩽ C_{1} n^{- 1},

\frac{D _{n}^{2} lo g ^{q_{2}} ( d n )}{σ _{g}^{2} n _{1}} ⩽ C_{1} n^{- ζ_{1}}, and P (Δ_{A, 1} lo g^{4} (d) > C_{1} n^{- ζ_{2}}) ⩽ C_{1} n^{- 1},

R \in R sup P_{∣ D_{n}} (U_{n_{1}, A}^{#} \in R) - P (Y_{A} \in R) ⩽ C n^{- (ζ_{1} \land ζ_{2}) /6},

R \in R sup P_{∣ D_{n}} (U_{n_{1}, A}^{#} \in R) - P (Y_{A} \in R) ⩽ C n^{- (ζ_{1} \land ζ_{2}) /6},

\overline{S}_{2, k}^{(i_{1})} := {i_{1}} \cup S_{2, k}^{(i_{1})}, G_{i_{1}} := \frac{1}{K} k = 1 \sum K H (X_{\overline{S}_{2, k}^{(i_{1})}}, W_{\overline{S}_{2, k}^{(i_{1})}}) .

\overline{S}_{2, k}^{(i_{1})} := {i_{1}} \cup S_{2, k}^{(i_{1})}, G_{i_{1}} := \frac{1}{K} k = 1 \sum K H (X_{\overline{S}_{2, k}^{(i_{1})}}, W_{\overline{S}_{2, k}^{(i_{1})}}) .

U_{n, n_{1}}^{#} := r U_{n_{1}, A}^{#} + α_{n}^{1/2} U_{n, B}^{#} .

U_{n, n_{1}}^{#} := r U_{n_{1}, A}^{#} + α_{n}^{1/2} U_{n, B}^{#} .

\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} ( n _{1} \land N )} ⩽ C_{1} n^{- ζ},

\frac{r ^{q_{1}} D _{n}^{2} lo g ^{q_{*}} ( d n )}{σ _{g}^{2} ( n _{1} \land N )} ⩽ C_{1} n^{- ζ},

R \in R sup P_{∣ D_{n}} (U_{n, n_{1}}^{#} \in R) - P (r Y_{A} + α_{n}^{1/2} Y_{B} \in R) ⩽ C n^{- (ζ - 1/ ν) /6} .

R \in R sup P_{∣ D_{n}} (U_{n, n_{1}}^{#} \in R) - P (r Y_{A} + α_{n}^{1/2} Y_{B} \in R) ⩽ C n^{- (ζ - 1/ ν) /6} .

R \in R sup P (n (U_{n, N}^{'} - θ) \in R) - P_{∣ D_{n}} (U_{n, n_{1}}^{#} \in R) ⩽ C n^{- ζ /7} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Approximating high-dimensional infinite-order $U$ -statistics: statistical and computational guarantees

Yanglei Songlabel=e1][email protected] [

Xiaohui Chenlabel=e2][email protected] [

Kengo Katolabel=e3][email protected] [

Department of Mathematics and Statistics, Queen’s University, 48 University Ave, Kingston, ON, Canada, K7L 3N6

Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign, IL 61820

Department of Statistics and Data Science, Cornell University, 1194 Comstock Hall, Ithaca, NY 14853

University of Illinois at Urbana-Champaign and Cornell University

Abstract

We study the problem of distributional approximations to high-dimensional non-degenerate $U$ -statistics with random kernels of diverging orders. Infinite-order $U$ -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.

Infinite-order $U$ -statistics,

incomplete $U$ -statistics,

Gaussian approximation,

bootstrap,

random forests,

uncertainty quantification,

keywords:

\arxiv

arXiv:1901.01163 \startlocaldefs

\endlocaldefs

and

1 Introduction

Let $X_{1},\dots,X_{n}$ be independent and identically distributed (i.i.d.) random variables taking value in a measurable space $(S,\mathcal{S})$ with common distribution $P$ , and let $h:S^{r}\to\mathbb{R}^{d}$ be a symmetric and measurable function with respect to the product space $S^{r}$ equipped with the product $\sigma$ -field $\mathcal{S}^{r}=\mathcal{S}\otimes\cdots\otimes\mathcal{S}$ ( $r$ times). Assume $\mathbb{E}[|h_{j}(X_{1},\ldots,X_{r})|]<\infty$ for $1\leqslant j\leqslant d$ , and consider the statistical inference on the mean vector $\theta=(\theta_{1},\ldots,\theta_{d})^{T}=\mathbb{E}[h(X_{1},\dots,X_{r})]$ . A natural estimator for $\theta$ is the $U$ -statistic with kernel $h$ :

[TABLE]

where $I_{n,r}:=\{\iota=(i_{1},\ldots,i_{r}):1\leqslant i_{1}<\ldots<i_{r}\leqslant n\}$ is the set of all ordered $r$ -tuples of $1,\dots,n$ and $|\cdot|$ denotes the set cardinality. The positive integer $r$ is called the order or degree of the kernel $h$ or the $U$ -statistic $U_{n}$ . We refer to [21] as an excellent monograph on $U$ -statistics.

In the present paper, we are interested in the situation where the order $r$ may be nonneglible relative to the sample size $n$ , i.e., $r=r_{n}\to\infty$ as $n\to\infty$ . $U$ -statistics with divergent orders are called infinite-order $U$ -statistic (IOUS) [14]. IOUS has attracted renewed interests in the recent statistics and machine learning literature in relation to uncertainty quantification for Breiman’s bagging [3] and random forests [4]. In such applications, the tree-based prediction rules can be thought of as $U$ -statistics with deterministic and random kernels, respectively, and their order corresponds to the sub-sample size of the training data [23]. Statistically, the subsample size $r$ used to build each tree needs to increase with the total sample size $n$ to produce reliable predictions. As a leading example, we consider construction of simultaneous prediction intervals for a version of random forests discussed in [23].

Example 1.1 (Simultaneous prediction intervals for random forests).

Consider a training dataset of size $n$ , $\{(Y_{1},Z_{1}),\dots,(Y_{n},Z_{n})\}=\{X_{1},\dots,X_{n}\}=X_{1}^{n}$ , where $Y_{i}\in\mathcal{Y}$ is a vector of features and $Z_{i}\in\mathbb{R}$ is a response. Let $h$ be a deterministic prediction rule that takes as input a sub-sample $\{X_{i_{1}},\dots,X_{i_{r}}\}$ and outputs predictions on $d$ testing points $(y_{1}^{*},\dots,y_{d}^{*})$ in the feature space $\mathcal{Y}$ . Then $U_{n}$ in (1) are the overall predictions by averaging over all possible sub-samples of size $r$ .

For random forests [4, 23], the tree-based prediction rule may be constructed with additional randomness: in building a tree or multiple trees based on a sub-sample, the split at each node may only occur on a randomly selected subset of features. Thus, let $\{W_{\iota}:\iota\in I_{n,r}\}$ be a collection of i.i.d. random variables taking value in a measurable space $(S^{\prime},\mathcal{S}^{\prime})$ that are independent of the data $X_{1}^{n}$ , and that determine the potential splits for each sub-sample. Here, each $W_{\iota}$ captures the random mechanism in building a prediction function based on $X_{\iota}=(X_{i_{1}},\ldots,X_{i_{r}})$ , but are assumed to be independent for different sub-samples. Further, let $H:S^{r}\times S^{\prime}\to\mathbb{R}^{d}$ be an $\mathcal{S}^{r}\otimes\mathcal{S}^{\prime}$ -measurable function, that represents the random forest algorithm, such that $\mathbb{E}[H(x_{1},\ldots,x_{r},W)]=h(x_{1},\ldots,x_{r})$ . Then predictions of random forests are given by a $d$ -dimensional $U$ -statistic with random kernel $H$ :

[TABLE]

where the random kernel $H$ varies with $r$ .

Compared to $U$ -statistics with fixed orders (i.e., $r$ being fixed), the analysis of IOUS brings nontrivial computational and statistical challenges due to increasing orders. First, even for a moderately large value of $r$ , exact computation of all possible ${n\choose r}$ trees is intractable. For diverging $r$ , it is not possible to compute $U_{n}$ in polynomial-time of $n$ . Second, the variance of the Hájek projection (i.e., the first-order term in the Hoeffding decomposition [19]) of $U_{n}-\theta$ tends to zero as $r\to\infty$ . To wit, define a function $g:S\to[0,\infty)$ by $g(x_{1})=\mathbb{E}[h(x_{1},X_{2},\ldots,X_{r})]$ , and

[TABLE]

Then the Hájek projection of $U_{n}-\theta$ is given by $n^{-1}r\sum_{i=1}^{n}(g(X_{i})-\theta)$ . By the orthogonality of the projections, we have

[TABLE]

Thus the variances of the kernel $h$ and its associated Hájek projection $g$ have different magnitudes. In particular, if the variance of $h_{j}(X_{1},\ldots,X_{r})$ is bounded by a constant $C>0$ (which is often assumed for random forests, cf. [23]), then ${\sigma}^{2}_{g,j}\leqslant C/r$ , which vanishes as $r$ diverges. Thus standard Gaussian approximation results in literature are no longer applicable in our setting since they require that the componentwise variances are bounded below from zero to avoid degeneracy, i.e., there is an absolute constant $\underline{\sigma}^{2}>0$ such that $\underline{\sigma}^{2}_{g}\geqslant\underline{\sigma}^{2}$ (cf. [11, 10, 6]).

In this work, our focus is to derive computationally tractable and statistically valid sub-sampling procedures for making inference on $\theta$ with a class of high-dimensional random kernels (i.e., large $d$ ) of diverging orders (i.e., increasing $r$ ). To break the computational bottleneck, we consider the incomplete version of $\widehat{U}_{n}$ by sampling (possibly much) fewer terms than $|I_{n,r}|$ . In particular, we consider the Bernoulli sampling scheme introduced in [8]. Given a positive integer $N$ , which represents our computational budget, define the sparsity design parameter $p_{n}:=N/|I_{n,r}|$ , and let $\{Z_{\iota}:\iota\in I_{n,r}\}$ be a collection of i.i.d. Bernoulli random variables with success probability $p_{n}$ , that are independent of the data $X_{1}^{n}$ and $\{W_{\iota}:\iota\in I_{n,r}\}$ . Consider the following incomplete $U$ -statistic (on the data $X_{1}^{n}$ ) with random kernel and weights:

[TABLE]

Obviously, $U^{\prime}_{n,N}$ is an unbiased estimator of $\theta$ and it only involves computing $\widehat{N}$ terms, which on average is much smaller than $|I_{n,r}|$ if $p_{n}\ll 1$ .

When the kernel $h$ is both deterministic and of fixed order, finite sample bounds for the Gaussian and bootstrap approximations of $U^{\prime}_{n,N}-\theta$ (after a suitable normalization) are established in [8]. Roughly speaking, error bound analysis in [8] has two major steps: $i)$ establish the Gaussian approximation to the Hájek projection, and $ii)$ bound the maximum norm of all higher-order degenerate terms. As discussed above, the first-order Hájek projection in the Hoeffding decomposition is asymptotically vanishing for the IOUS, and we must control the moments of an increasing number of degenerate terms, which makes the analysis of the incomplete IOUS with random kernels substantially more subtle.

In Section 2, we derive non-asymptotic Gaussian approximation error bounds for approximating the distribution of the incomplete IOUS $U^{\prime}_{n,N}$ with random kernels subject to sub-exponential moment conditions. Specifically, our rates of convergence for the Gaussian approximation of $U^{\prime}_{n,N}$ have the explicit dependence on all parameters $(n,N,d,r,\underline{\sigma}_{g}^{2},D_{n})$ , where $D_{n}$ is an upper bound for the $\psi_{1}$ norms of the random kernels (for precise statements, see conditions (C3), (C4), and (C3’) ahead). In particular, asymptotic validity of the Gaussian approximation can be achieved if $\underline{\sigma}_{g}^{-2}r^{2}D_{n}^{2}\log^{7}(dn)=o(n\wedge N)$ . The order of $\underline{\sigma}_{g}^{-2}$ will be application specific. As we shall verify in Section 4, under certain regularity conditions,

[TABLE]

It is worth noting that (4) is sharp in the sense that for the linear kernel $h(x_{1},\cdots,x_{r})=(x_{1}+\cdots+x_{r})/r$ , we have $\underline{\sigma}_{g}^{-2}\asymp r^{2}$ if $c\leqslant\text{Var}(X_{1j})\leqslant C$ . If further $D_{n}=O(1)$ , $\log(d)=O(\log(n))$ and $n=O(N)$ (i.e., the computational complexity is at least linear in sample size), then the order of $U^{\prime}_{n,N}$ is allowed to increase at the rate of $r=o(n^{1/4-\epsilon})$ for any $\epsilon\in(0,1/4)$ . On the other hand, the dimension may grow exponentially fast in sample size (i.e., $d=O(e^{n^{c}})$ for some constant $c\in(0,1/7)$ ) to maintain the asymptotic validity of Gaussian approximations while $r$ is still allowed to increase at a polynomial rate in $n$ .

The proof of our Gaussian approximation results for IOUS builds upon a number of recently developed technical tools such as Gaussian approximation results for sum of independent random vectors and $U$ -statistics of fixed orders [11, 10, 6, 7], anti-concentration inequality for Gaussian maxima [9], and iterative conditioning argument for high-dimensional incomplete $U$ -statistics (with the fixed kernel and order) [8]. However, there are three technical innovations in our proof to accommodate the issues of diverging orders and randomness of the kernel. First, we use the iterative renormalization for each dimension of $g$ and also $H$ by its variance. This simple trick turns out to be the crux to avoid the lower bound assumption for Gaussian approximation in the literature [10, 8]. Second, we derive an order-explicit maximal inequality for the expected supremum of the remainder of the Hájek projection of the IOUS (cf. Section 5). This maximal inequality is new in literature and our main tools include a symmetrization inequality of [27] and Bonami inequality [13, Theorem 3.2.2] for the Rademacher chaos, both with the explicit dependence on $r$ . Third, we develop new tail probability inequalities for $U$ -statistics with random kernels by leveraging the independence between $\{W_{\iota},\iota\in I_{n,r}\}$ and the data $X_{1}^{n}$ .

In Section 3, we derive computationally tractable and fully data-driven inferential methods of $\theta$ based on the incomplete IOUS when the sample size $n$ , the dimension $d$ , and the order $r$ , are all large. We consider a multiplier bootstrap procedure consisting of two partial bootstraps that are conditionally independent given $X_{1}^{n}$ and $\{W_{\iota},Z_{\iota}:\iota\in I_{n,r}\}$ : one estimates the covariance matrix of the randomized kernel, and the other estimates the Hájek projection. The latter is usually computationally demanding, and we develop a divide and conquer algorithm to maintain the overall computational cost of our multiplier bootstrap procedure at most $O(n^{2}d+B(N+n)d)$ , where $B$ denotes the number of bootstrap iterations. Thus the computational cost of the bootstrap to approximate the sampling distribution for incomplete IOUS can be made independent of the order $r$ , even though $r$ diverges.

In Section 4, we discuss the key non-degeneracy condition (4) for deriving the validity of Gaussian and bootstrap approximations. We provide a general embedding scheme where a Cramér-Rao type lower bound can be established for the minimum $\underline{\sigma}_{g}^{2}$ of the projection variances. Specifically, the lower bound for $r^{2}\underline{\sigma}_{g}^{2}$ only involves the sensitivity of $\mathbb{E}[h(X_{1},\ldots,X_{r})]$ under perturbation and the Fisher information of the embedded family, which in some cases remain constants as $r$ diverges. In non-parametric regressions, there is a natural embedding of the response variable into a location family such that the sensitivity and Fisher information can be explicitly computed.

1.1 Connections to the literature

For univariate $U$ -statistics ( $d=1$ ), the asymptotic distributions are derived in the seminal paper [19] for the non-degenerate case. [14] introduced the notion “infinite-order $U$ statistics” (IOUS) with diverging orders and established the central limit theorem for $U_{n}$ when $d=1$ . For univariate IOUS, asymptotic normality of IOUS can be found in [2, Chapter 4.6], and the Berry-Esseen type bounds for IOUS were established by [16, 30, 31]. Further, [23] applied IOUS to construct a prediction interval for one test point. However, $i).$ [23] does not address the issue that the variance of the Hájek projection is vanishing: the two conditions in Theorem 1 therein, $\mathbb{E}h_{k_{n}}(Z_{1},\ldots,Z_{k_{n}})\leqslant C<\infty$ and $\lim\zeta_{1,k_{n}}\neq 0$ , are not compatible based on our previous discussions ; $ii).$ in practice, the size $d$ of a test set may be comparable to or even much larger than the size $n$ of a training set, and the current work is motivated by such consideration. Limit theorems of the related infinite-order $V$ -statistics and the infinite-order $U$ -processes were studied in [28, 18]. The high-dimensional Gaussian approximation results and bootstrap methods were established in [11, 10] for sum of independent random vectors, and in [6, 8] for $U$ -statistics. We refer readers to these references for extensive literature review.

Incomplete $U$ -statistics were first introduced in [1], which can be viewed as a special case of weighted $U$ -statistics. There is a large literature on limit theorems for weighted $U$ -statistics; see [26, 24, 22, 25]. The asymptotic distributions of incomplete $U$ -statistics (for fixed $d$ ) were derived in [5] and [20]; see also Section 4.3 in [21] for a review on incomplete $U$ -statistics. Recently, incomplete U-statistics have gained renewed interests in the statistics and machine learning literatures [12, 23]. To the best of our knowledge, the current paper is the first work that establishes distributional approximation theorems for incomplete IOUS with random kernels and increasing orders in high dimensions.

The remaining of the paper is organized as follows. We develop Gaussian approximation results for above $U$ -statistics in Section 2, and bootstrap methods for the variance of the approximating Gaussian distribution in Section 3. We apply the theoretical results to several examples in Section 4. We highlight a maximal inequality in Section 5, and present all other proofs in Appendix A.

1.2 Notation

We write $\text{l.h.s.}\lesssim\text{r.h.s.}$ if there exists a finite and positive absolute constant $C$ such that $\text{l.h.s.}\leqslant C\times\text{r.h.s.}$ . We shall use $c,C,C_{1},C_{2},\dots$ to denote finite and positive absolute constants, whose value may differ from place to place. We denote $X_{i},\ldots X_{i^{\prime}}$ by $X_{i}^{i^{\prime}}$ for $i\leqslant i^{\prime}$ .

For $a,b\in\mathbb{R}$ , let $\lfloor a\rfloor$ denote the largest integer that does not exceed $a$ , $a\vee b=\max\{a,b\}$ and $a\wedge b=\min\{a,b\}$ . For $a,b\in\mathbb{R}^{d}$ , we write $a\leqslant b$ if $a_{j}\leqslant b_{j}$ for $1\leqslant j\leqslant d$ , and write $[a,b]$ for the hyperrectangle $\prod_{j=1}^{d}[a_{j},b_{j}]$ if $a\leqslant b$ . We denote by $\mathcal{R}:=\{\prod_{j=1}^{d}[a_{j},b_{j}]:-\infty\leqslant a_{j}\leqslant b_{j}\leqslant\infty\}$ the collection of hyperrectangles in $\mathbb{R}^{d}$ . Further, for $a\in\mathbb{R}^{d}$ , $r,t\in\mathbb{R}$ , $ra+t$ is a vector in $\mathbb{R}^{d}$ with $j^{th}$ component being $ra_{j}+t$ . For a matrix $A=(a_{ij})$ , denote $\|A\|_{\infty}=\max_{i,j}|a_{ij}|$ . For a diagonal matrix $\Lambda$ with positive diagonal entries, $\Lambda^{-1/2}$ (resp. $\Lambda^{1/2}$ ) is the diagonal matrix, with $j$ -th diagonal entry being $\Lambda_{jj}^{-1/2}$ (resp. $\Lambda_{jj}^{1/2}$ ).

For $\beta>0$ , let $\psi_{\beta}:[0,\infty)\to\mathbb{R}$ be a function defined by $\psi_{\beta}(x)=e^{x^{\beta}}-1$ , and for any real-valued random variable $\xi$ , define $\|\xi\|_{\psi_{\beta}}=\inf\{C>0:\mathbb{E}[\psi_{\beta}(|\xi|/C)]\leqslant 1\}$ . Further, we define a family of functions $\{\widetilde{\psi}_{\beta}(\cdot)\}$ on $[0,\infty)$ indexed by $\beta>0$ . For $\beta\geqslant 1$ , define $\widetilde{\psi}_{\beta}=\psi_{\beta}$ . For $\beta\in(0,1)$ , define $\tau_{\beta}=(\beta e)^{1/\beta}$ , $x_{\beta}=(1/\beta)^{1/\beta}$ , and $\widetilde{\psi}_{\beta}(x)=\tau_{\beta}x\mathbbm{1}_{\{x<x_{\beta}\}}+e^{x^{\beta}}\mathbbm{1}_{\{x\geqslant x_{\beta}\}}$ .

For a generic random variable $Y$ , let $\mathbb{P}_{|Y}(\cdot)$ and $\mathbb{E}_{|Y}[\cdot]$ denote the conditional probability and expectation given $Y$ , respectively. Further, we write “a.s.” for “almost surely” and “w.r.t.” for “with respect to”. Throughout the paper, we assume that $r\geqslant 2$ , $d\geqslant 3$ , $n\geqslant 4$ , $p_{n}:=N/|I_{n,r}|\leqslant 1/2$ .

2 Gaussian approximations for IOUS

In this section, we shall derive non-asymptotic Gaussian approximation error bounds for: (i) the IOUS with random kernel $\widehat{U}_{n}$ in (2), which includes the IOUS with deterministic kernel $U_{n}$ in (1) as a special case, and (ii) the incomplete IOUS $U^{\prime}_{n,N}$ in (3) under the Bernoulli sampling scheme.

Recall that $h(x_{1}^{r})=\mathbb{E}[H(x_{1}^{r},W)]$ , $g(x_{1})=\mathbb{E}[h(x_{1},X^{r}_{2})]$ , $\theta=\mathbb{E}[g(X_{1})]$ , $\sigma_{g,j}^{2}=\mathbb{E}[(g_{j}(X_{1})-\theta_{j})^{2}]$ and $\underline{\sigma}^{2}_{g}=\min_{1\leqslant j\leqslant d}\sigma^{2}_{g,j}$ . Further, define

[TABLE]

Clearly, for $1\leqslant j\leqslant d$ , $\sigma^{2}_{H,j}\geqslant\sigma^{2}_{g,j}$ and thus $\underline{\sigma}^{2}_{H}:=\min_{1\leqslant j\leqslant d}\sigma^{2}_{H,j}\geqslant\underline{\sigma}^{2}_{g}$ . Define two $d\times d$ diagonal matrices $\Lambda_{g}$ and $\Lambda_{H}$ such that

[TABLE]

Let $Y_{A}$ and $Y_{B}$ be two independent $d$ -dimensional zero mean Gaussian random vectors with variance $\Gamma_{g}$ and $\Gamma_{H}$ respectively. We may take $Y_{A}$ and $Y_{B}$ to be independent of any other random variables. Further, for any two zero mean $d$ -dimensional random vectors $U$ and $Y$ ,

[TABLE]

where we recall that $\mathcal{R}:=\{\prod_{j=1}^{d}[a_{j},b_{j}]:-\infty\leqslant a_{j}\leqslant b_{j}\leqslant\infty\}$ is the collection of hyperrectangles in $\mathbb{R}^{d}$ .

Finally, in view of the discussions in the Introduction (Section 1) and to simplify presentation, we assume $\underline{\sigma}_{g}^{2}\leqslant 1$ . Otherwise, the conclusions in this paper hold with $\underline{\sigma}_{g}$ replaced by $\min\{\underline{\sigma}_{g},1\}$ .

2.1 IOUS with random kernel

We start with $\widehat{U}_{n}$ . Define for $1\leqslant j\leqslant d$ , $q>0$ , and $(x_{1},\ldots,x_{r})\in S^{r}$ ,

[TABLE]

We make following assumptions: there exist $D_{n}\geqslant 1$ and an absolute constant $q>0$ such that

[TABLE]

Clearly, if $\left|H_{j}(X_{1}^{r},W)\right|\lesssim D_{n}$ a.s. for $1\leqslant j\leqslant d$ , then the latter three conditions hold. Indeed, (C3) and (C4) follow immediately from the definition, and (C2) is due to the observation that $\mathbb{E}|g_{j}(X_{1})-\theta_{j}|^{4}\lesssim\mathbb{E}|g_{j}(X_{1})-\theta_{j}|^{2}D_{n}^{2}=\sigma^{2}_{g,j}D_{n}^{2}$ .

Theorem 2.1.

Assume (C1-ND), (C2), (C3) and (C4) hold. Then

[TABLE]

where $q_{*}:=(6/q+1)\vee 7$ , $Y_{A}\;\sim\;N(0,\Gamma_{g})$ and $\lesssim$ means up to a multiplicative constant that only depends on $q$ .

Proof.

See Section A.3. We highlight that a key step to establish Theorem 2.1 is to control the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel (See Theorem 5.1). Then the Gaussian approximation result for IOUS follows from Gaussian approximation results for sum of independent random vectors [10] and anti-concentration inequality [9], by a similar argument in [8] with proper normalization. $\blacksquare$

Clearly, in the special case of non-random kernel, i.e., $H(x_{1},\ldots,x_{r},W)=h(x_{1},\ldots,x_{r})$ , (C4) trivially holds. Thus we have the following immediate result for the IOUS with deterministic kernel $U_{n}$ in (1).

Corollary 2.2.

Assume (C1-ND), (C2) and (C3) hold. Then

[TABLE]

*where $q_{*}:=(6/q+1)\vee 7$ , and $\lesssim$ means up to a multiplicative constant that only depends on $q$ . *

Remark 2.3 (Comparisons with existing results for $d=1$ ).

For the univariate IOUS with non-random kernels, asymptotic normality and its rate of convergence are well understood in literature; see [2] for a survey of results in this direction. In [31], a Berry-Esseen bound is derived for symmetric statistics, which include IOUS (with non-random kernels) as a special case. In particular, applying Corollary 4.1 in [31] to IOUS, the rate of convergence to normality is of order $O(r^{2}n^{-1/2}\sigma_{H}^{2}/\sigma_{g}^{2})$ for a bounded kernel, which implies that asymptotic normality requires (at least) $r=o(n^{1/6})$ . A related Berry-Esseen bound is given in [16]. In both papers, the rates of convergence are suboptimal. For elementary symmetric polynomials (which are $U$ -statistics corresponding to the product kernel $h(x_{1},\dots,x_{r})=x_{1}\cdots x_{r}$ ), it is shown in [30] that the sharp rate of convergence to normality is of order $O(rn^{-1/2})$ , provided that $\operatorname{\mathds{E}}[X_{1}]\neq 0,\text{Var}(X_{1})\in(0,\infty)$ , $\operatorname{\mathds{E}}[|X_{1}|^{3}]<\infty$ and $r=O((\log{n})^{-1}(\log_{2}(n))^{-1}n^{1/2})$ . This result implies that asymptotic normality for the IOUS with the product kernel is achieved when $r=O(\log^{-2}(n)n^{1/2})$ . If $\underline{\sigma}_{g}^{-2}=O(r^{2})$ , which holds under regularity conditions in Lemma 4.1, our Corollary 2.2 with $q=1$ implies that the rate of convergence for high-dimensional IOUS is $O((r^{4}\log^{7}(dn)n^{-1})^{1/6})$ (with suitably bounded moments). In particular, Gaussian approximation is asymptotically valid if $\log{d}=O(\log{n})$ and $r=o(n^{1/4-\epsilon})$ for any $\epsilon\in(0,1/4)$ . Even though our result is valid for a smaller range of $r$ and the rate is slower than the optimal rate in the case $d=1$ , Corollary 2.2 does allow the dimension to grow sub-exponentially fast in sample size, which is a useful feature for high-dimensional statistical inference. In addition, to the best of our knowledge, the validity of bootstrap procedures proposed in Section 3 to approximate the sampling distribution of IOUS (on hyperrectangles in $\mathbb{R}^{d}$ ) are new in literature.

2.2 Incomplete IOUS with random kernel

Now we consider $U_{n,N}^{\prime}$ , where we recall that $N$ is some given computational budget. We will assume the following conditions: for $q>0$ ,

[TABLE]

Clearly, (C4) and (C3’) implies (C3) up to a multiplicative constant. Further, (C3’) and (C5) hold if $\left|H_{j}(X_{1}^{r},W)\right|\lesssim D_{n}$ a.s. for $1\leqslant j\leqslant d$ .

Theorem 2.4.

Assume (C1-ND), (C2), (C4), (C3’) and (C5) hold. Then

[TABLE]

where $\alpha_{n}:=n/N$ , $q_{1}:=2\vee(2/q)$ , $q_{*}:=(6/q+1)\vee 7$ , $\lesssim$ means up to a multiplicative constant that only depends on $q$ , and we recall that $Y_{A}\sim N(0,\Gamma_{g})$ , $Y_{B}\sim N(0,\Gamma_{H})$ and $Y_{A},Y_{B}$ are independent.

Proof.

See Section A.4.4. $\blacksquare$

Remark 2.5.

If $q\geqslant 1$ , then $q_{1}=2$ and $q_{*}=7$ . Since $\|\xi\|_{\psi_{1}}\lesssim\|\xi\|_{\psi_{q}}$ for any random variable $\xi$ and $q\geqslant 1$ , we may assume without loss of generality that $q\leqslant 1$ in the proof. When $r$ is fixed, $q=1$ , the kernel is deterministic, and there exists some absolute constant $\sigma^{2}>0$ such that $\underline{\sigma}_{g}^{2}\geqslant\sigma^{2}$ , then the above Theorem recovers Theorem 3.1 from [8].

Further, by first conditioning on $X_{1}^{n}$ , we have

[TABLE]

where for two square matrices, $A\succeq B$ means $A-B$ is positive semi-definite. Thus the random kernel $H(\cdot)$ increases the variance of the approximating Gaussian distribution compared to the associated deterministic kernel $h(\cdot)$ .

3 Bootstrap approximations

In Section 2.2, we have seen that the incomplete $U$ -statistic with random kernel is approximated by a Gaussian distribution $N(0,r^{2}\Gamma_{g}+\alpha_{n}\Gamma_{H})$ . However, the covariance term is typically unknown in practice. In this section, we will estimate $\Gamma_{g}$ and $\Gamma_{H}$ by bootstrap methods.

3.1 Bootstrap for $\Gamma_{H}$

Let $\mathcal{D}_{n}:=\{X_{1},\ldots,X_{n}\}\cup\{W_{\iota},Z_{\iota}:\iota\in I_{n,r}\}$ be the data involved in the definition of $U^{\prime}_{n,N}$ , and take a collection of independent $N(0,1)$ random variables $\{\xi^{\prime}_{\iota}:\iota\in I_{n,r}\}$ that is independent of the data $\mathcal{D}_{n}$ . Define the following bootstrap distribution:

[TABLE]

The next theorem establishes the validity of $U_{n,B}^{\#}$ .

Theorem 3.1.

Assume the conditions (C1-ND) (C2), (C4), (C3’) and (C5) hold. If

[TABLE]

for $q_{1}:=2\vee(2/q)$ , $q_{2}:=(4/q+1)\vee 5$ , some constants $C_{1}>0$ and $\zeta\in(0,1)$ , then there exists a constant $C$ depending only on $q$ , $C_{1}$ and $\zeta$ such that with probability at least $1-C/n$ ,

[TABLE]

Proof.

See Section A.5.1. $\blacksquare$

3.2 Bootstrap for the approximating Gaussian distribution

Let $S_{1}\subset\{1,\ldots,n\}$ , and $n_{1}=|S_{1}|$ . Further, consider a collection of $\mathcal{D}_{n}$ -measurable $\mathbb{R}^{d}$ -valued random vectors $\{G_{i_{1}}:i_{1}\in S_{1}\}$ , where $G_{i_{1}}$ is some “good” estimator of $g(X_{i_{1}})$ , and its form is specified later. We use the following quantity to measure the quality of $G_{i_{1}}$ as an estimator of $g(X_{i_{1}})$

[TABLE]

Define $\overline{G}:=\frac{1}{n_{1}}\sum_{i_{1}\in S_{1}}G_{i_{1}}$ and consider the following bootstrap distribution for $N(0,\Gamma_{g})$ :

[TABLE]

where $\{\xi_{i_{1}}:i_{1}\in S_{1}\}$ is a collection of independent $N(0,1)$ random variables that is independent of $\mathcal{D}_{n}$ and $\{\xi^{\prime}_{\iota}:\iota\in I_{n,r}\}$ .

Lemma 3.2.

Assume the conditions (C1-ND), (C2) and (C3’) hold. If

[TABLE]

for $q_{2}:=(4/q+1)\vee 5$ , some constants $C_{1}$ , and $\zeta_{1},\zeta_{2}\in(0,1)$ . Then there exists a constant $C$ depending only on $q$ , $C_{1}$ and $\zeta_{1}$ such that with probability at least $1-C/n$ ,

[TABLE]

where we recall that $Y_{A}\;\sim\;N(0,\Gamma_{g})$ .

Proof.

See Subsection A.5.2. $\blacksquare$

Hereafter we consider a special case of the divide and conquer bootstrap algorithm in [8] to estimate $\Gamma_{g}$ . For each $i_{1}\in S_{1}$ , partition the remaining indexes, $\{1,\ldots,n\}\setminus\{i_{1}\}$ , into disjoint subsets $\{S_{2,k}^{(i_{1})}:k=1,\ldots,K\}$ , each of size $L=r-1$ , where $K=\lfloor(n-1)/(r-1)\rfloor$ .

Now define for each $i_{1}\in S_{1}$ and $k=1,\ldots,K$ ,

[TABLE]

Finally, define

[TABLE]

Theorem 3.3.

Assume the conditions (C1-ND) (C2), (C4) (C3’) and (C5) hold. If

[TABLE]

for $q_{1}:=2\vee(2/q)$ , $q_{*}:=(6/q+1)\vee 7$ , some constants $C_{1}>0$ , $\zeta\in(0,1)$ . For any $\nu\in\left(\max\{7/6,1/\zeta\},\infty\right)$ , there exists a constant $C$ depending only on $q$ , $\zeta$ , $\nu$ and $C_{1}$ such that with probability at least $1-C/n$ ,

[TABLE]

Proof.

See Subsection A.5.3. $\blacksquare$

3.3 Simultaneous confidence intervals

We first combine the Gaussian approximation result with the bootstrap result.

Corollary 3.4.

Assume (C1-ND), (C2) (C4) (C3’) and (C5) hold. Further, assume that for some constants $C_{1}>0$ , $\zeta\in(0,1)$ , (12) holds. Then there exists a constant $C$ depending only on $q$ , $C_{1}$ and $\zeta$ such that with probability at least $1-C/n$ ,

[TABLE]

Proof.

It follows from Theorem 2.4 and Theorem 3.3 (with $\nu=7/\zeta$ ). $\blacksquare$

In simultaneous confidence interval construction, it is sometimes desirable to normalize the variance of each dimension, so that if we use maximum-type statistics, the critical value is not dominated by terms with large variance. Define for $1\leqslant j\leqslant d$ ,

[TABLE]

which are the diagonal elements in the conditional covariance matrices of $U_{n,A}^{\#}$ (10) and $U_{n,B}^{\#}$ (7) respectively. Further, define a $d\times d$ diagonal matrix $\widehat{\Lambda}$ with

[TABLE]

Corollary 3.5.

Assume the conditions in Corollary 3.4. Then there exists a constant $C$ depending only on $q$ , $C_{1}$ and $\zeta$ such that with probability at least $1-C/n$ ,

[TABLE]

Consequently,

[TABLE]

Proof.

See Subsection A.5.4. $\blacksquare$

Remark 3.6.

From Corollary 3.5, we can immediately construct confidence intervals for $\theta$ in a data-dependent way. Specifically, let $\widehat{q}_{1-\alpha}$ be a $(1-\alpha)^{th}$ quantile of the conditional distribution of $\|\widehat{\Lambda}^{-1/2}U^{\#}_{n,n_{1}}\|_{\infty}$ given $\mathcal{D}_{n}$ . Then one way to construct simultaneous confidence intervals with confidence level $(1-\alpha)$ is as follows: for $1\leqslant j\leqslant d$ , $U_{n,N,j}^{\prime}\;\pm\;\widehat{q}_{1-\alpha}\;n^{-1/2}\widehat{\Lambda}^{1/2}_{j,j}$ .

4 Applications

In many applications, $g(x)=\mathbb{E}[h(x,X_{2},\ldots,X_{r})]$ does not admit an explicit form, and thus it is usually hard to compute $\underline{\sigma}_{g}$ in conditions (C1-ND) and (12) directly. When the kernel $h$ has special structures, we can establish a lower bound on $\underline{\sigma}_{g}$ with explicit dependence on $r$ , which can be applied to Example 1.1. We shall give additional examples in Section 4.3 and 4.4 to illustrate the usefulness of $U$ -statistics as a tool to estimate and make inference of certain statistical functionals of $X_{1},\dots,X_{r}$ . In Section 4.3 for the expected maximum and log-mean functionals, we also establish a lower bound on $\underline{\sigma}_{g}$ with explicit dependence on $r$ . In Section 4.4 for the kernel density estimation problem, $r$ is assumed to be fixed, but we allow the diameter of the design points to diverge.

For simplicity of the presentation, in this section, we assume that all involved derivatives and integrals exist and are finite, and that the order of integrals and the order of integral and differentiation can be exchanged. These assumptions can be justified under standard smoothness and moment conditions. For illustration, we use $q=1$ in (C4) and (C3’).

4.1 Lower bound for $\underline{\sigma}_{g}$

Suppose that the distribution $P$ of $X_{1}$ has a density function $f_{0}$ with respect to some $\sigma$ -finite (reference) measure $\mu$ , i.e.,

[TABLE]

We first embed $f_{0}$ into a family of densities $\{f_{\beta}:\beta\in B\subset\mathbb{R}^{\ell}\}$ , where $B$ is an open neighborhood of $0\in\mathbb{R}^{\ell}$ . Such embeddings always exist and below are some examples for $S=\mathbb{R}^{\ell}$ .

Location and scale family. If $\mu$ is the Lebesgue measure on $\mathbb{R}^{\ell}$ , we may consider the following location or scaling families: for $x\in\mathbb{R}^{\ell}$ ,

[TABLE] 2. 2.

Exponential family. If $\phi(\beta):=\log\left(\int f_{0}(x)e^{\beta^{T}x}\mu(dx)\right)<\infty$ for $\beta\in B$ , then we may consider the exponential family:

[TABLE] 3. 3.

Additive noise model. Let $\Upsilon$ be a $\mathbb{R}^{\ell}$ -dimensional random vector independent of $X_{1}$ , whose distribution is absolutely continuous w.r.t. $\mu$ , then $X_{1}+\beta\Upsilon$ has a density $f_{\beta}$ given by the convolution of those of $X_{1}$ and $\beta\Upsilon$ .

For $\beta\in B$ , define the following perturbed expectation

[TABLE]

where $\mathbb{E}_{\beta}$ denotes the expectation when $X_{1},\ldots,X_{r}$ have density $f_{\beta}$ . Further, define

[TABLE]

where $\nabla$ denotes the gradient (or derivative when $\beta$ is a scalar) with respect to $\beta$ and $\text{Var}_{\beta}$ denotes the covariance matrix when $X_{1},\ldots,X_{r}$ have the density $f_{\beta}$ . Thus $\Psi(\beta)$ is the score function and $\mathcal{J}(\beta)$ is the Fisher-information for a single observation.

Lemma 4.1.

If we assume $\mathcal{J}(0)$ is positive definite, then

[TABLE]

In particular, if there exists an absolute positive constant $c$ such that

[TABLE]

then $\underline{\sigma}_{g}^{2}\geqslant cr^{-2}$ .

Proof.

See Subsection A.6. $\blacksquare$

4.2 Simultaneous prediction intervals for random forests

Consider the Example 1.1 and assume that $(Y_{1},Z_{1})$ has density $q(y)p(z;y)$ w.r.t. the product measure $\nu(dy)\otimes dz$ on $\mathcal{Y}\times\mathbb{R}$ , i.e., for $A_{1}\in\mathcal{B}(\mathcal{Y}),A_{2}\in\mathcal{B}(\mathbb{R})$ ,

[TABLE]

That is, the feature $Y_{1}$ has the density $q(y)$ w.r.t. some $\sigma$ -finite measure $\nu$ on $\mathcal{Y}$ , and thus is allowed to have both continuous and discrete components. The response $Z_{1}$ given $Y_{1}=y$ has a conditional density $p(z;y)$ w.r.t. the Lebesgue measure.

For many regression algorithms such as tree based methods, if we fix the features and increase the responses of training samples by $\beta\in\mathbb{R}$ , the prediction at any test point will increase by $\beta$ , i.e., $\text{ for }1\leqslant j\leqslant d$ ,

[TABLE]

which implies that $h\left((y_{1},z_{1}+\beta),\ldots,(y_{r},z_{r}+\beta)\right)=h\left((y_{1},z_{1}),\ldots,(y_{r},z_{r})\right)+\beta$ . Now we consider the embedding into the “location” family $\{q(y)p(z-\beta;y):\beta\in\mathbb{R}\}$ . Observe that

[TABLE]

which implies that $\theta_{j}^{\prime}(0)=1$ . In addition,

[TABLE]

Thus if we assume that there exists $c$ such that

[TABLE]

then (13) reduces to $\underline{\sigma}_{g}^{2}\geqslant cr^{-2}$ . If further we assume that $H_{j}(X_{1}^{r},W)\leqslant C$ a.s. for some constant $C$ and each $1\leqslant j\leqslant d$ (this holds for example when the response is bounded a.s.), then the conditions (C2), (C3), (C4) and (C5) hold with $D_{n}=\ln^{-2}(2)C$ . With these assumptions, the condition (12) in Corollary 3.5 simplifies as

[TABLE]

Thus if $r=O(n^{1/4-\epsilon})$ for some $\epsilon>0$ , $\log(d)=O(\log(n))$ , and $n=O(n_{1}\wedge N)$ , then Corollary 3.5 can be used to construct asymptotically valid simultaneous prediction intervals with the error of approximation decaying polynomially fast in $n$ .

Remark 4.2 (Fisher information in nonparametric regressions).

Let us take a closer look at the condition (14). Consider the nonparametric regression model

[TABLE]

where $\kappa:\mathcal{Y}\to\mathbb{R}$ is a deterministic measurable function, and $\epsilon_{1},\ldots\epsilon_{n}$ are i.i.d. with some density $f$ with respect to the Lebesgue measure. Then $p(z;y)=f(z-\kappa(y))$ and thus

[TABLE]

where for the last equality, we first perform integration w.r.t. $dz$ and apply a change-of-variable. Thus $\mathcal{J}(0)$ only depends the density of the noise.

4.3 Expected maximum and log-mean functionals

Next we compute the lower bounds on $\underline{\sigma}_{g}^{2}$ for two additional statistical functionals.

Example 4.3.

Let $S=\mathbb{R}^{d}$ and consider the following two kernels: for $1\leqslant j\leqslant d$ ,

[TABLE]

In the former case, we are interested in estimating the expectation for the coordinate-wise maxima of $r$ independent random vectors, $\{\mathbb{E}[\max_{1\leqslant i\leqslant r}X_{ij}]:1\leqslant j\leqslant d\}$ . In the latter, we assume $X_{1j}>0$ for $1\leqslant j\leqslant d$ and are interested in estimating $\{\mathbb{E}[\log(r^{-1}\sum_{i=1}^{r}X_{ij})]:1\leqslant j\leqslant d\}$ . In both cases, the coordinates of $X_{1}$ can have arbitrary dependence, and we allow $r\to\infty$ .

Consider the first kernel in Example 4.3, where $S=\mathbb{R}^{d}$ , and $h_{j}(x_{1},\ldots,x_{r})=\max_{1\leqslant i\leqslant r}x_{ij}$ for $1\leqslant j\leqslant d$ . Assume $X_{1j}$ has a density $f_{j}$ w.r.t. the Lebesgue measure on $\mathbb{R}$ for $1\leqslant j\leqslant d$ , and we consider the following embedding $\{f_{j}(\cdot-\beta):\beta\in\mathbb{R}\}$ . As in the previous example, for $\beta\in\mathbb{R}$

[TABLE]

Thus, by Lemma 4.1, if we assume for some absolute positive constant $c$

[TABLE]

we have $\underline{\sigma}_{g}^{2}\geqslant cr^{-2}$ . Further, if we assume that there exists a positive constant $C$ such that

[TABLE]

then by maximal inequality (e.g., see [29, Lemma 2.2.2]), $\|\max_{1\leqslant i\leqslant r}X_{ij}\|_{\psi_{1}}\lesssim\log(r)$ . Then if we select $D_{n}=C^{\prime}\underline{\sigma}_{g}^{-1}\log^{2}(r)$ , the conditions (C2), (C3) and (C5) hold. Further, (C4) trivially holds for non-random kernels. With above assumptions and selection of $D_{n}$ , the condition (12) in Corollary 3.5 simplifies as $(n_{1}\wedge N)^{-1}{r^{6}\log^{4}(r)\log^{7}(dn)}\leqslant C_{1}n^{-\zeta}$ .

Now consider the second kernel in Example 4.3, where $h_{j}(x_{1},\ldots,x_{r})=\log\left(r^{-1}\sum_{i=1}^{r}x_{ij}\right)$ and $X_{1j}>0$ for $1\leqslant j\leqslant d$ . Assume $X_{1j}$ has a density $f_{j}$ w.r.t. the Lebesgue measure on $\mathbb{R}$ for $1\leqslant j\leqslant d$ , and consider the following embedding $\{(1+\beta)f_{j}((1+\beta)\cdot):\beta\in(-1,1)\}$ . As before, it is easy to see that for $1\leqslant j\leqslant d$ ,

[TABLE]

Thus if there exists a constant $c$ such that $\max_{1\leqslant j\leqslant d}\mathcal{J}_{j}(0)\leqslant c^{-1}$ , then $\underline{\sigma}_{g}^{2}\geqslant cr^{-2}$ . Further, if there exists a constant $C>0$ such that

[TABLE]

then the conditions (C2), (C3), (C4) and (C5) hold with $D_{n}=\ln^{-1}(2)\log(C)$ . With these assumptions, the condition (12) in Corollary 3.5 simplifies as $(n_{1}\wedge N)^{-1}{r^{4}\log^{7}(dn)}\leqslant C_{1}n^{-\zeta}$ .

4.4 Kernel density estimation

Example 4.4 (Kernel density estimation).

Let $\tau:S^{r}\to\mathbb{R}^{\ell}$ be a measurable function that is symmetric in its $r$ arguments, and $\{t_{j}:1\leqslant j\leqslant d\}\subset\mathbb{R}^{\ell}$ be $d$ design points. [15, 17] used $U_{n}$ as a kernel density estimator (KDE) for the density of $\tau(X_{1},\ldots,X_{r})$ at the given design points with

[TABLE]

where $b_{n}>0$ is a bandwidth parameter, and $\kappa(\cdot)$ is the density estimation kernel with $\int\kappa(z)dz=1$ , which should not be confused with the $U$ -statistic kernel $h$ . For this example, we will assume $r$ fixed and the bandwidth $b_{n}\to 0$ , but allow the diameter of the design points, $\max_{1\leqslant j\leqslant d}\|t_{j}\|$ , to grow, where $\|\cdot\|$ denotes the usual Euclidean norm.

Assume that given $X_{1}=x_{1}$ , $\tau(x_{1},X_{2}^{r})$ has a density $f(z;x_{1})$ w.r.t. the Lebesgue measure on $\mathbb{R}^{\ell}$ , i.e., $\mathbb{P}\left(\,\tau(x_{1},X_{2},\ldots,X_{r})\in A\right)=\int_{A}f(z;x_{1})dz$ for any $A\in\mathcal{B}(\mathbb{R}^{\ell})$ . Then by definition, for $1\leqslant j\leqslant d$ ,

[TABLE]

For $t\in\mathbb{R}^{\ell}$ , denote

[TABLE]

As in [15], if $\int\kappa^{2}(z)dz<\infty$ and $\sup_{t}\mathbb{E}[f^{2}(t;X_{1})]<\infty$ , then $\lim_{n\to\infty}\mathcal{V}_{n}(t)=\mathcal{V}(t)$ for any fixed $t$ . If there exists some $R>0$ such that $\max_{1\leqslant j\leqslant d}\|t_{j}\|\leqslant R$ for any $d\in\mathbb{N}$ and $\inf_{t\in\mathbb{R}^{\ell}:|t|\leqslant R}\mathcal{V}(t)>0$ , under mild continuity assumptions (e.g. the equicontinuty of $\mathcal{V}_{n}(t)$ ), there exists an absolute constant $c>0$ such that $\underline{\sigma}_{g}^{2}\geqslant c$ for large $n$ . Then we can apply the result in [8], which does not allow $\underline{\sigma}_{g}^{2}$ to vanish.

In this work, we allow $\underline{\sigma}_{g}^{2}$ to vanish, and thus allow the diameter of the design points to grow as $n$ becomes large. Specifically, if we assume $\kappa(\cdot)$ is bounded by some constant $C$ , we can select $D_{n}=\ln^{-1}(2)Cb_{n}^{-1}$ in conditions (C2), (C3), (C4) and (C5). Then the condition (12) in Corollary 3.5 simplifies as

[TABLE]

Thus if $\log(d)=O(\log(n))$ and $n=O(n_{1}\wedge N)$ , to apply Corollary 3.5, we require that $\underline{\sigma}_{g}^{-2}=O(b_{n}^{2}n^{1-\epsilon})$ for any $\epsilon>0$ .

Remark 4.5.

[15] considers the case $d=1$ , and shows the $\sqrt{n}$ -convergence rate of the KDE. The same discussion applies here. [17] constructs confidence bands (without computational considerations and bootstrap results) for the density of $\tau(X_{1}^{r})$ , under the additional assumptions required to establish the convergence of empirical processes.

5 Maximal inequality

In this section, we derive an upper bound on the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel. This maximal inequality (with the explicit dependence on $r$ ) serves as a key step to establish the Gaussian approximation result for the incomplete IOUS with random kernel.

Theorem 5.1.

Assume (C3) hold. Then there exist constants $c,C$ , depending only on $q$ , such that if ${r^{2}\log(d)}/{n}\leqslant c$ , then

[TABLE]

The proof of Theorem 5.1 is quite involved: we need to develop a number of technical tools such as the symmetrization inequality and Bonami inequality (i.e., exponential moment bound) for the Rademacher chaos, all with the explicit dependence on $r$ .

We start with some notation. Let $X^{\prime}:=(X_{1}^{\prime},\ldots,X_{n}^{\prime})$ be an independent copy of $X:=(X_{1},\ldots,X_{n})$ , and $\epsilon:=(\epsilon_{1},\ldots,\epsilon_{n})$ be i.i.d. Rademacher random variables, i.e., $\mathbb{P}(\epsilon_{1}=1)=\mathbb{P}(\epsilon_{1}=-1)=1/2$ , that are independent of $X$ and $X^{\prime}$ . If all involved random variables are independent, we write $\mathbb{E}_{\epsilon}$ (resp. $\mathbb{E}_{X^{\prime}}$ ) for expectation only w.r.t. $\epsilon$ (resp. $X^{\prime}$ ).

For a given probability space $(X,\mathcal{A},Q)$ , a measurable function $f$ on X and $x\in X$ , we use the notation $Qf=\int fdQ$ whenever the latter integral is well-defined, and denote $\delta_{x}$ the Dirac measure on $X$ , i.e., $\delta_{x}(A)=\mathds{1}{\{x\in A\}}$ for any $A\in\mathcal{A}$ . For a measurable symmetric function $f$ on $S^{r}$ and $k=0,1,\ldots,r$ , let $P^{r-k}f$ denote the function on $S^{k}$ defined by

[TABLE]

whenever it is well defined. To prove Theorem 5.1, without loss of generality, we may assume

[TABLE]

since we can always consider $h(\cdot)-\theta$ instead. For $0\leqslant k\leqslant r$ , define

[TABLE]

Clearly $\pi_{k}$ is degenerate of order $k$ with respect to the distribution $P$ in the sense of (16) below. For any $\iota=(i_{1},\ldots,i_{k})\in I_{n,k}$ , and $J=(j_{1},\ldots,j_{\ell})\in I_{k,\ell}$ where $0\leqslant\ell\leqslant k$ , define

[TABLE]

Then $\pi_{k}h(x_{\iota})=\mathbb{E}_{X^{\prime}}\left[\sum_{\ell=0}^{k}(-1)^{k-\ell}\sum_{J\in I_{k,\ell}}\widetilde{\pi}_{k}h(x_{\iota_{J}},X^{\prime}_{\iota\setminus\iota_{J}})\right]\text{ for all }\iota\in I_{n,k}.$

Further, the Hoeffding decomposition [19] for the $U$ -statistic (with $\theta=0$ ) is as follows:

[TABLE]

Finally, for any $1\leqslant k\leqslant r$ , define the envelope function

[TABLE]

5.1 Symmetrization inequality

For each integer $k$ , consider a symmetric kernel $f:S^{k}\to\mathbb{R}^{d}$ . We say that $f$ is degenerate of order $k$ with respect to the distribution $P$ if

[TABLE]

The following result is essentially due to [27, Section 3, Symmetrization inequality] in the $U$ -process setting. We provide a self-contained (and perhaps more transparent) proof for completeness.

Theorem 5.2 (Symmetrization inequality).

Assume (16) holds.

[TABLE]

Remark 5.3.

In Theorem 5.2, the symmetrization costs a multiplicative factor of $2^{k}$ for a degenerate kernel of order $k$ . Standard symmetrization argument for such degenerate $U$ -statistics (cf. [13, Theorem 3.5.3]) together with the decoupling inequalities (cf. [13, Theorem 3.1.1]) in literature yield that

[TABLE]

where $C_{k}=2^{4k-2}(k-1)!(k^{k}-1)((k-1)^{k-1}-1)\times\cdots\times(2^{2}-1)$ . Since $2^{k}\ll C_{k}$ , improvement of the constant to the exponential growth in $k$ turns out to be crucial to obtain the maximal inequality for the IOUS in Theorem 5.1. The major component for the super-exponential behavior of $C_{k}$ is due to the step for applying the decoupling inequality in [13, Theorem 3.1.1], which is valid for any (measurable) symmetric kernel. If the kernel $f$ is degenerate of order $k$ , then symmetrization can be directly done without the decoupling inequality (cf. the proof of Theorem 5.2 below).

Proof of Theorem 5.2.

Define a new sequence of random variables $\{Z_{i}:1\leqslant i\leqslant n\}$ :

[TABLE]

Further, for each $\iota=\{i_{1},\ldots,i_{k}\}\in I_{n,k}$ , define

[TABLE]

Due to degeneracy, we have

[TABLE]

where the first and third equalities follow from definitions and Fubini Theorem, and the second follows from the degeneracy. To wit, on the event that $\{\epsilon_{i_{\ell}}=-1\}$ for some $1\leqslant\ell\leqslant k$ ,

[TABLE]

The rest of the argument is standard: by Jensen’s inequality,

[TABLE]

Since $(X_{1},\ldots,X_{n},\epsilon_{1},\ldots,\epsilon_{n})$ and $(Z_{1},\ldots,Z_{n},\epsilon_{1},\ldots,\epsilon_{n})$ have the same distribution, taking expectation on both sides completes the proof. $\blacksquare$

5.2 Maximal inequality

We start with a lemma, whose proof is elementary and thus omitted. Recall the definition of $\widetilde{\psi}_{\beta}$ in Subsection 1.2.

Lemma 5.4.

For any $\beta>0$ , $\widetilde{\psi}_{\beta}(\cdot)$ is strictly increasing, convex, and $\widetilde{\psi}_{\beta}(0)=0$ . Further, for any $\beta>0$ ,

[TABLE]

and consequently

[TABLE]

Now we state the maximal inequality with explicit constants.

Lemma 5.5.

Fix $\beta\in(0,1]$ . Consider a sequence of non-negative random variables $\{Z_{j}:1\leqslant j\leqslant d\}$ , and assume that there exists some real number $\Delta>0$ such that $\mathbb{E}[\widetilde{\psi}_{\beta}\left({Z_{j}}/{\Delta}\right)]\leqslant 2,\text{ for }1\leqslant j\leqslant d$ . Then

[TABLE]

Proof.

By monotonicity and convexity,

[TABLE]

Then the proof is complete by Lemma 5.4. $\blacksquare$

5.3 Exponential moment of Rademacher chaos

The goal is to establish an exponential moment bound (i.e., Bonami inequality) of Rademacher chaos of order $k$ . Based on the well-known hyper-contractivity of Rademacher chaos variables in literature (cf. [13, Corollary 3.2.6]), our Lemma 5.6 below provides an exponential moment bound with an explicit dependence on the order.

Lemma 5.6 (Exponential moment of Rademacher chaos).

Fix $k\geqslant 2$ , $\beta=2/k$ and let $\{x_{\iota}:\iota\in I_{n,k}\}$ be a collection of real numbers. Consider the following homogeneous chaos of order $k$ :

[TABLE]

where $\epsilon_{1},\ldots,\epsilon_{n}$ are i.i.d. Rademacher random variables. Then

[TABLE]

Proof.

Denote $\kappa=\sqrt{\mathbb{E}[Z^{2}]}$ , $c=\sqrt{7}$ and thus $\Delta_{n}=c^{k}\kappa$ . Observe that $\beta\leqslant 1$ and $\beta k=2$ . From [13, Theorem 3.2.2], we have for any $q>0$

[TABLE]

Here, the first inequality clearly holds for $q\leqslant 2$ , and we use [13, Theorem 3.2.2] for $q>2$ . Then using the fact that $e^{x}\leqslant 1+\sum_{\ell=1}^{\infty}|x|^{\ell}/{\ell!}$ and by Lemma 5.4, we have

[TABLE]

Using the fact that $\ell^{\ell}\leqslant e^{\ell}\ell!$ , we have

[TABLE]

Since $c^{2}=7>e$ , we have

[TABLE]

which completes the proof.

$\blacksquare$

5.4 Proof of Theorem 5.1

Now we are in position to prove Theorem 5.1. Recall that we assume $\theta=0$ . First, for each $2\leqslant k\leqslant r$ and $1\leqslant j\leqslant d$ , define

[TABLE]

where $\pi_{k}h$ is defined in (15), and $\epsilon_{1},\ldots,\epsilon_{n}$ are i.i.d. Rademacher random variables. Define

[TABLE]

By Jensen’s inequality and the fact that $(\sum_{i=1}^{n}z_{n})^{2}\leqslant n\sum_{i=1}^{n}z_{n}^{2}$ , we have for any $1\leqslant j\leqslant d$ ,

[TABLE]

Then by Lemma 5.6,

[TABLE]

Further, by Lemma 5.5 with $\beta=2/k$ , we have

[TABLE]

Then by Lemma 5.2 and Jensen’s inequality, we have

[TABLE]

Now we bound $\mathbb{E}[F_{k}^{2}(X_{1},\ldots,X_{k})]$ . By the definition of $\widetilde{\pi}_{k}h_{j}$ , condition (C3), Lemma 5.4 and Jensen’s inequality, we have

[TABLE]

Since $\widetilde{\psi}_{q}(0)=0$ , by Jensen’s inequality, we have $\|\widetilde{\pi}_{k}h_{j}(X_{1},\ldots,X_{k})|\|_{\widetilde{\psi}_{q}}\leqslant 2D_{n}$ . Then by the standard maximal inequality (e.g., see [29, Lemma 2.2.2]), there exists a constant $C$ , depending only on $q$ , such that for $1\leqslant k\leqslant r$ ,

[TABLE]

Thus we obtain that

[TABLE]

Observe that if $r^{2}\leqslant n$ , we have for any $1\leqslant i\leqslant r$

[TABLE]

Further, for any $x,y\geqslant 2$ , $\log^{k/2}(x+y)\leqslant 2^{k/2}(\log^{k/2}(x)+\log^{k/2}(y))$ . Now, take $c=1/500$ , and in particular $r^{2}\leqslant n$ . Then

[TABLE]

For the first term, by geometric series formula,

[TABLE]

For the second term, since for any $\ell\geqslant 1$ , $\ell^{\ell}\leqslant e^{\ell}\ell!$ , we have

[TABLE]

which completes the proof of Theorem 5.1. $\blacksquare$

Appendix A Proofs

A.1 Tail probabilities

In this section, we collect and prove some results regarding tail probabilities for sum of independent random vectors, $U$ -statistics, and $U$ -statistics with random kernels. For each type of statistics, we present two versions, one for non-negative random variables and the other for general cases.

These inequalities are used in bounding the effects due to sampling (Subsection A.4.3), and also in controlling the $\|\cdot\|_{\infty}$ distance between the bootstrap covariance matrices and their targets (Section A.5).

A.1.1 Tail probabilities for sum of independent random vectors

In this subsection, $m,n,d\geqslant 2$ are all integers.

Lemma A.1.

Let $Z_{1},\ldots,Z_{m}$ be independent $\mathbb{R}^{d}$ -valued random vectors and $\beta\in(0,1]$ . Assume that

[TABLE]

Then there exists some constant $C$ that only depends on $\beta$ such that

[TABLE]

Proof.

See Subsection A.7.1. $\blacksquare$

Lemma A.2.

Let $Z_{1},\ldots,Z_{m}$ be independent $\mathbb{R}^{d}$ -valued random vectors and $\beta\in(0,1]$ . Assume that

[TABLE]

Then there exists some constant $C$ that only depends on $\beta$ such that

[TABLE]

where $\sigma^{2}:=\max_{1\leqslant j\leqslant d}\sum_{i=1}^{m}\mathbb{E}[Z_{ij}^{2}]$ .

Proof.

See Subsection A.7.2 $\blacksquare$

Lemma A.3.

Let $Z_{1},\ldots,Z_{m}$ be independent and identical distributed Bernoulli random variables with success probability $p_{n}$ , i.e., $\mathbb{P}(Z_{i}=1)=1-\mathbb{P}(Z_{i}=0)=p_{n}$ for $1\leqslant i\leqslant m$ . Further, let $a_{1},\ldots,a_{m}$ be deterministic $\mathbb{R}^{d}$ vectors. Then there exists an absolute constant $C$ such that

[TABLE]

where $\sigma^{2}:=\max_{1\leqslant j\leqslant d}\sum_{i=1}^{m}a_{ij}^{2}$ and $M=\max_{1\leqslant i\leqslant m,1\leqslant j\leqslant d}|a_{ij}|$ .

Proof.

See Subsection A.7.3 $\blacksquare$

A.1.2 Tail probabilities for $U$ -statistics

Lemma A.4.

Let $X_{1},\ldots,X_{n}$ be i.i.d. random variables taking value in $(S,\mathcal{S})$ and fix $\beta\in(0,1]$ . Let $f:(S^{r},\mathcal{S}^{r})\to\mathbb{R}^{d}$ be a measurable, symmetric function such that $\text{ for all }j=1,\ldots,d$ ,

[TABLE]

Define $U_{n}:=|I_{n,r}|^{-1}\sum_{\iota\in I_{n,r}}f(X_{\iota})$ . Then there exists a constant $C$ that only depends on $\beta$ such that

[TABLE]

Clearly, we can replace $v_{n}$ by $u_{n}$ .

Proof.

See Subsection A.7.4. $\blacksquare$

Lemma A.5.

Let $X_{1},\ldots,X_{n}$ be i.i.d. random variables taking value in $(S,\mathcal{S})$ and fix $\beta\in(0,1]$ . Let $f:(S^{r},\mathcal{S}^{r})\to\mathbb{R}^{d}$ be a measurable, symmetric function such that

[TABLE]

Define $U_{n}:=|I_{n,r}|^{-1}\sum_{\iota\in I_{n,r}}f(X_{\iota})$ and $\sigma^{2}:=\max_{1\leqslant j\leqslant d}\mathbb{E}[f_{j}^{2}(X_{1}^{r})]$ . Then there exists a constant $C$ that only depends on $\beta$ such that

[TABLE]

Clearly, we can replace $\sigma$ by $u_{n}$ .

Proof.

See subsection A.7.5. $\blacksquare$

A.1.3 Tail probabilities for $U$ -statistics with random kernel

Let $X_{1},\ldots,X_{n}$ be i.i.d. random variables taking value in $(S,\mathcal{S})$ and $W,\{W_{\iota},\iota\in I_{n,r}\}$ be i.i.d. random variables taking value in $(S^{\prime},\mathcal{S}^{\prime})$ , that are independent of $X_{1}^{n}$ . In this subsection, we consider a measurable function $F:S^{r}\times S^{\prime}\to\mathbb{R}^{d}$ that is symmetric in the first $r$ variables, and fix some $\beta\in(0,1]$ . Further, define

[TABLE]

We first consider the non-negative random kernels.

Lemma A.6.

Consider $Z:=\max_{1\leqslant j\leqslant d}{|I_{n,r}|}^{-1}\sum_{\iota\in I_{n,r}}F_{j}(X_{\iota},W_{\iota}).$ Assume that for all $j=1,\ldots,d$ , $F_{j}(\cdot)\geqslant 0$ , and that there exists $u_{n}\geqslant 1$ such that

[TABLE]

Then there exists some constant $C$ that only depends on $\beta$ such that with probability at least $1-8/n$ ,

[TABLE]

Proof.

See subsection A.7.6. $\blacksquare$

Next, we consider centered random kernels.

Lemma A.7.

Consider $Z:=\max_{1\leqslant j\leqslant d}\left|{|I_{n,r}|}^{-1}\sum_{\iota\in I_{n,r}}\left(F_{j}(X_{\iota},W_{\iota})-f_{j}(X_{\iota})\right)\right|.$ Assume there exists $u_{n}\geqslant 1$ such that for all $j=1,\ldots,d$ ,

[TABLE]

Then there exists some constant $C$ that only depends on $\beta$ such that with probability at least $1-9/n$ ,

[TABLE]

Proof.

See subsection A.7.7. $\blacksquare$

A.2 Additional lemmas

The following Lemma concerns Gaussian approximation for sum of independent vectors. It replaces the $\|\cdot\|_{\psi_{1}}$ condition in Proposition 2.1 of [10] by $\|\cdot\|_{\psi_{q}}$ .

Lemma A.8.

Let $Z_{1},\ldots,Z_{n}$ be independent $\mathbb{R}^{d}$ -valued random vectors. Assume that for some absolute constant $\underline{\sigma}^{2}>0$ , and $q>0$ ,

[TABLE]

Then there exists some constant $C$ that only depends on $q$ and $\underline{\sigma}^{2}$ such that

[TABLE]

where $q_{*}=(6/q+1)\vee 7$ , $Y\sim N\left(0,\Sigma\right)$ , and $\Sigma:=n^{-1}\sum_{i=1}^{n}\mathbb{E}[Z_{i}Z_{i}^{\prime}]$ .

Proof.

See Subsection A.8. $\blacksquare$

The following lemmas are elementary, but used repeatedly.

Lemma A.9.

Let $\beta>0$ . There exits a constant $C$ , only depending on $\beta$ , such that for any positive integers $r,n$ such that $2\leqslant r\leqslant\sqrt{n}$ ,

[TABLE]

Proof.

Fix $\beta$ . If $r\to\infty$ , $n^{2}r^{\beta}/\|I_{n,r}\|\to 0$ . Thus there exits $M$ such that if $r\geqslant M$ , $n^{2}r^{\beta}\leqslant\|I_{n,r}\|$ . For $r<M$ , the inequality holds with $C=M^{\beta}$ . $\blacksquare$

Lemma A.10.

Let $\beta,k>0$ . For any random variable $X$ ,

[TABLE]

Proof.

Observe that

[TABLE]

which implies that $\|X^{k}\|_{\psi_{\beta}}\leqslant\|X\|^{k}_{\psi_{k\beta}}$ . The reverse direction is similar. $\blacksquare$

For $\beta<1$ , $\|\cdot\|_{\psi_{\beta}}$ is not a norm , but the usual triangle inequality and maximal inequality hold up to a multiplicative constant.

Lemma A.11.

Fix $\beta\in(0,1)$ .

(i)

For any random variables $X$ and $Y$ ,

[TABLE] 2. (ii)

Let $\xi_{1},\ldots,\xi_{n}$ be a sequence of random variables such that $\|\xi_{i}\|_{\psi_{\beta}}\leqslant D$ for $1\leqslant i\leqslant n$ , and $n\geqslant 2$ . Then there exists a constant $C$ depending only on $\beta$ such that

[TABLE]

Proof.

See Subsection A.8. $\blacksquare$

A.3 Proofs in Section 2.1

We first prove Corollary 2.2 and then prove Theorem 2.1.

Proof of Corollary 2.2.

Let $c$ be the constant in Theorem 5.1. Without loss of generality, we assume

[TABLE]

since $\rho(\cdot,\cdot)\leqslant 1$ and we can always consider $h(\cdot)-\theta$ instead. Recall that $q_{*}=(6/q+1)\vee 7$ .

Fix any rectangle $R=[a,b]\in\mathcal{R}$ , where $a,b\in\mathbb{R}^{d}$ and $a\leqslant b$ . Define

[TABLE]

Denote

[TABLE]

Then by Theorem 5.1,

[TABLE]

For any $t>0$ , by Markov inequality and definition,

[TABLE]

Due to assumptions (C2), (C3) and Cauchy-Schwarz inequality,

[TABLE]

Then due to Lemma A.8, we have

[TABLE]

Further, by anti-concentration inequality [10, Lemma A.1],

[TABLE]

Finally, taking $t=\left(\underline{\sigma}_{g}^{-2}n^{-1}r^{2}\log^{1+2/q}(d)D_{n}^{2}\right)^{1/4}$ and due to convention (17), we have

[TABLE]

Likewise, we can show the lower inequality

[TABLE]

which completes the proof. $\blacksquare$

Proof of Theorem 2.1.

As before, without loss of generality, we assume

[TABLE]

for some sufficiently small $c_{1}\in(0,1)$ . Define for each $\iota=(i_{1},\ldots,i_{r})\in I_{n,r}$ ,

[TABLE]

Then by definition,

[TABLE]

Step 1. We first show that

[TABLE]

Note that conditional on $X_{1}^{n}$ , $\mathrm{R}_{n}$ is an average of independent random vectors. Thus by [9, Lemma 8],

[TABLE]

By definition (6) and maximal inequality ([29, Lemma 2.2.2] and Lemma A.11),

[TABLE]

Define

[TABLE]

Under the assumption (C4) and again maximal inequality ([29, Lemma 2.2.2] and Lemma A.11), we have

[TABLE]

Then, we have

[TABLE]

Then due to Lemma A.9 and (18), we have

[TABLE]

Step 2. We finish the proof by a similar argument as in the proof of Corollary 2.2.

Fix any rectangle $R=[a,b]\in\mathcal{R}$ , where $a,b\in\mathbb{R}^{d}$ and $a\leqslant b$ . Define

[TABLE]

where we recall that $\Lambda_{g}$ is defined in (5). Recall that $\widehat{U}_{n}=U_{n}+\mathrm{R}_{n}$ . For any $t>0$ , by Markov inequality, the result from Step 1, and Corollary 2.2,

[TABLE]

Observe that $\mathbb{E}[\tilde{Y}_{A,j}^{2}]=1$ for $1\leqslant j\leqslant d$ . By anti-concentration inequality [10, Lemma A.1],

[TABLE]

Finally, taking $t=\left(\underline{\sigma}^{2}_{g}n^{-1}r^{2}\log^{2/q}(dn)D_{n}^{2}\right)^{1/4}$ and due to convention (18), we have

[TABLE]

By a similar argument, we can show

[TABLE]

which completes the proof. $\blacksquare$

A.4 Proofs in Section 2.2

In this subsection, without loss of generality, we assume $\theta=0$ . Recall the definition $\Lambda_{H}$ in (5). Further, define a function $\tilde{H}:S^{r}*S^{\prime}\to\mathbb{R}^{d}$ by $\tilde{H}(x_{1}^{r},w)=\Lambda_{H}^{-1/2}H(x_{1}^{r},w)\text{ for any }x_{1}^{r}\in S^{r},w\in S^{\prime}$ , and

[TABLE]

Clearly, if (C5) holds, then

[TABLE]

where again we applied Cauchy–Schwarz inequality for $k=1$ .

A.4.1 Bounding $\widehat{N}/N$

The following lemma follows from an application of Bernstein’s inequality and is proved in the Step 5 of the proof of [8, Theorem 3.1]. It is included here for easy reference.

Lemma A.12.

Assume $\sqrt{\log(n)/N}\leqslant 1/4$ . Then

[TABLE]

A.4.2 Bounding the normalized covariance estimator

Lemma A.13.

Assume (C3’), (C4) and (C5) hold. Then there exists a constant $C$ , depending only on $q$ , such that with probability at least $1-13/n$ ,

[TABLE]

Proof.

Define $v(x_{1}^{r}):=\mathbb{E}[\tilde{H}(x_{1}^{r},W)\tilde{H}(x_{1}^{r},W)^{T}],\;\widehat{V}:=|I_{n,r}|^{-1}\sum_{\iota\in I_{n,r}}v(X_{\iota})$ . Observe that

[TABLE]

We will bound these two terms separately.

Step 0. We first make a few observations. Clearly, $\mathbb{E}[v(X_{1}^{r})]=\Gamma_{\tilde{H}}$ , and for all $1\leqslant j,k\leqslant d$ , by Jensen’s inequality for conditional expectation and (21),

[TABLE]

Further, by definition

[TABLE]

As a result, by the assumptions (C4) and (C3’), and Lemma A.10,

[TABLE]

Step 1. We bound $\|\widehat{\Gamma}_{\tilde{H}}-\widehat{V}\|_{\infty}$ using Lemma A.7 with $F=\tilde{H}\tilde{H}^{T}$ and $\psi_{q/2}$ . For $1\leqslant j,k\leqslant d$ , define

[TABLE]

Observe that due to Lemma A.10 and A.11,

[TABLE]

Then due to (23) and the assumptions (C4) and (C3’),

[TABLE]

Now we apply Lemma A.7, with probability at least $1-9/n$ ,

[TABLE]

Step 2. We bound $\|\widehat{V}-\Gamma_{\tilde{H}}\|_{\infty}$ using Lemma A.5 with $\psi_{q/2}$ . By (22) and (23), with probability at least $1-4/n$ ,

[TABLE]

Then the proof is complete by combining step 1 and 2. $\blacksquare$

A.4.3 Bounding the effect of sampling

The following quantity will appear in the proof of Theorem 2.4:

[TABLE]

The next lemma establishes conditional Gaussian approximation for $\sqrt{N}\zeta_{n}$ .

Lemma A.14.

Suppose the assumptions in Theorem 2.4 hold. There exists a constant $C$ , depending on $q$ , such that with probability at least $1-C/n$ ,

[TABLE]

where we recall that $Y_{B}\sim N(0,\Gamma_{H})$ , and we abbreviate $\mathbb{P}_{|X,W}$ for $\mathbb{P}_{|X_{1}^{n},\{W_{\iota}:\iota\in I_{n,r}\}}$ .

Proof.

Consider conditionally independent (conditioned on $X,W$ ) $\mathbb{R}^{d}$ -valued random vectors $\{\widehat{Y}_{\iota}:\iota\in I_{n,r}\}$ such that

[TABLE]

Clearly, $\widehat{Y}|X,W\;\sim\;N(0,\widehat{\Gamma}_{\tilde{H}})$ . Further, define

[TABLE]

By triangle inequality, it then suffices to show that each of the following events happens with probability at least $1-C/n$ ,

[TABLE]

on which we now focus. Without loss of generality, since $\underline{\sigma}_{g}\leqslant 1$ , we assume

[TABLE]

for some sufficiently small constant $c_{1}\in(0,1)$ that is to be determined. Recall that $q_{1}=2\vee(2/q)$ and $q_{*}=(6/q+1)\vee 7$ .

Step 0. By Lemma A.13 and A.9,

[TABLE]

In particular, since $\Gamma_{\tilde{H},jj}=1$ , if we take $c_{1}$ small enough such that $Cc_{1}^{1/2}\leqslant 1/2$ , then $\mathbb{P}\left(\min_{1\leqslant j\leqslant d}\widehat{\Gamma}_{\tilde{H},jj}\geqslant 1/2\right)\geqslant 1-13/n$ .

Step 1. The goal is to show that the first event in (25), $\rho^{\mathcal{R}}_{|X,W}(\sqrt{N}\zeta_{n},\widehat{Y})\leqslant C\varpi_{n}$ , holds with probability at least $1-C/n$ .

Step 1.1. Define

[TABLE]

Further, $\widehat{M}_{n}(\phi):=\widehat{M}_{n,X}(\phi)+\widehat{M}_{n,Y}(\phi)$ , where

[TABLE]

By Theorem 2.1 in [10], there exist absolute constants $K_{1}$ and $K_{2}$ such that for any real numbers $\overline{L}_{n}$ and $\overline{M}_{n}$ , we have

[TABLE]

on the event $\mathcal{E}_{n}:=\{\widehat{L}_{n}\leqslant\overline{L}_{n}\}\cap\{\widehat{M}_{n}(\phi_{n})\leqslant\overline{M}_{n}\}\cap\{\min_{1\leqslant j\leqslant d}\widehat{\Gamma}_{\tilde{H},jj}\geqslant 1/2\}$ .

In Step 0, we have shown $\mathbb{P}\left(\min_{1\leqslant j\leqslant d}\widehat{\Gamma}_{\tilde{H},jj}\geqslant 1/2\right)\geqslant 1-13/n$ . In Step 1.2-1.4, we select proper $\overline{L}_{n}$ and $\overline{M}_{n}$ such that the first two events happen with probability at least $1-C/n$ . In Step 1.5, we plug in these values.

Step 1.2: Select $\overline{L}_{n}$ . Since $p_{n}\leqslant 1/2$ , $\mathbb{E}|Z_{\iota}-p_{n}|^{3}\leqslant Cp_{n}$ , and thus

[TABLE]

We will apply Lemma A.6 with $F(\cdot)=|\tilde{H}(\cdot)|^{3}$ and $\beta=q/3$ . Thus for $1\leqslant j\leqslant d$ , define

[TABLE]

First, by iterated expectation and due to (21),

[TABLE]

Second, observe that $\sigma_{H,j}^{3}f_{j}(x_{1}^{r})\lesssim\mathbb{E}\left[|H_{j}(x_{1}^{r},W)-h_{j}(x_{1}^{r})|^{3}\right]+|h_{j}(x_{1}^{r})|^{3}\lesssim B_{n,j}^{3}(x_{1}^{r})+|h_{j}(x_{1}^{r})|^{3}$ , and thus due to (C3), (C4) and Lemma A.10 and A.11,

[TABLE]

Further, observe that by Lemma A.11,

[TABLE]

Thus by the same argument, $\|b_{j}(X_{1}^{r})\|_{\psi_{q/3}}\;\lesssim\;(\underline{\sigma}_{H}^{-1}D_{n})^{3}$ . Then by Lemma A.6, with probability at least $1-8n^{-1}$ ,

[TABLE]

Due to Lemma A.9 and assumption (26), $\mathbb{P}(\widehat{L}_{n}\;\leqslant\;C\underline{\sigma}_{H}^{-1}p_{n}^{-1/2}D_{n})\geqslant 1-8/n$ . Thus there is a constant $C_{1}$ , depending on $q$ , such that if

[TABLE]

then $\mathbb{P}(\widehat{L}_{n}\leqslant\overline{L}_{n})\geqslant 1-8/n$ .

Step 1.3: bounding $\widehat{M}_{n,X}(\phi_{n})$ . Since $Z_{\iota}$ is a Bernoulli random variable, it is clear that $\widehat{M}_{n,X}(\phi_{n})=0$ on the event

[TABLE]

where we use the value (30) for $\overline{L}_{n}$ .

By assumption (C3’) and Lemma A.11,

[TABLE]

Due to (26),

[TABLE]

Thus if we take $c_{1}$ in (26) to be sufficiently small such that

[TABLE]

then $\mathbb{P}(\widehat{M}_{n,X}(\phi_{n})=0)\geqslant 1-2/n$ and $\phi_{n}\geqslant 1$ .

Step 1.4: select $\overline{M}_{n}$ . From Step 1.3, we have shown that

[TABLE]

Then by the same argument as in Step 1.4 of the proof of [8, Theorem 3.1] and due to (26) and $\phi_{n}\geqslant 1$ , on the event $\mathcal{E}^{\prime}_{n}$ , for any $\iota\in I_{n,r}$ ,

[TABLE]

Thus there exists an absolute constant $C_{2}$ such that if we set

[TABLE]

then $\mathbb{P}(\widehat{M}_{n,Y}(\phi_{n})\leqslant\overline{M}_{n})\geqslant 1-2/n$ .

Step 1.5: plug in $\overline{L}_{n}$ and $\overline{M}_{n}$ . Recall the definition $\overline{L}_{n}$ and $\overline{M}_{n}$ in (30) and (31). With these selections, we have shown that $\mathbb{P}(\mathcal{E}_{n})\geqslant 1-C/n$ , where we recall that $\mathcal{E}_{n}:=\{\widehat{L}_{n}\leqslant\overline{L}_{n}\}\cap\{\widehat{M}_{n}(\phi_{n})\leqslant\overline{M}_{n}\}\cap\{\min_{1\leqslant j\leqslant d}\widehat{\Gamma}_{\tilde{H},jj}\geqslant 1/2\}$ . Further, on the event $\mathcal{E}_{n}$ ,

[TABLE]

which completes the proof of Step 1.

Step 2. The goal is to show that the second event in (25), $\rho^{\mathcal{R}}_{|X,W}(\widehat{Y},\Lambda^{-1/2}_{H}Y_{B})\leqslant C\varpi_{n}$ , holds with probability at least $1-C/n$ .

Observe that $\text{Cov}(\Lambda^{-1/2}_{H}Y_{B})=\Gamma_{\tilde{H}}$ and $\Gamma_{\tilde{H},jj}=1$ for $1\leqslant j\leqslant d$ . By the Gaussian comparison inequality [8, Lemma C.5],

[TABLE]

on the event that $\{\|\widehat{\Gamma}_{\tilde{H}}-\Gamma_{\tilde{H}}\|_{\infty}\leqslant\overline{\Delta}\}$ . From (27) in Step 0,

[TABLE]

Thus if we set $\overline{\Delta}=C(\underline{\sigma}_{H}^{-2}n^{-1}r\log^{1\vee(2/q-1)}(dn)D_{n}^{2})^{1/2}$ , then with probability at least $1-C/n$ ,

[TABLE]

$\blacksquare$

A.4.4 Proof of Theorem 2.4

Without loss of generality, we assume that

[TABLE]

Observe that

[TABLE]

where we recall that $\widehat{U}_{n}$ and $\zeta_{n}$ is defined in Section 2.1 and in (24) respectively. Denote $Y:=rY_{A}+\alpha_{n}^{1/2}Y_{B}$ .

Step 1: the goal is to show that

[TABLE]

For any rectangle $R\in\mathcal{R}$ , observe that

[TABLE]

By Lemma A.14, since $n^{-1}\lesssim\varpi_{n}$ , we have

[TABLE]

where we recall that $Y_{B}$ is independent of all other random variables. Further, by Theorem 2.1,

[TABLE]

Observe that $\mathbb{E}[(\sigma_{g,j}^{-1}rY_{A,j})^{2}]=r^{2}\geqslant 1$ for any $1\leqslant j\leqslant d$ , $\|\Gamma_{H}\|_{\infty}\lesssim D_{n}^{2}$ due to (C3’), and $\alpha_{n}p_{n}=n/|I_{n,r}|\lesssim n^{-1}$ . Then by the Gaussian comparison inequality [8, Lemma C.5] and due to (32)

[TABLE]

Similarly, we can show $\mathbb{P}(\sqrt{n}\Phi_{n}\in R)\;\geqslant\;\mathbb{P}\left(rY_{A}+\sqrt{\alpha_{n}}Y_{B}\in R\right)-C\varpi_{n}$ . Thus the proof of Step 1 is complete.

Step 2: we show that with probability at least $1-C\varpi_{n}$ ,

[TABLE]

Clearly, $\mathbb{E}[Y_{j}^{2}]=r^{2}\sigma_{g,j}^{2}+\alpha_{n}\sigma_{H,j}^{2}$ . Then due to (C3’), $\mathbb{E}[Y_{j}^{2}]\leqslant(r^{2}+\alpha_{n})D_{n}^{2}$ . Since $Y$ is a multivariate Gaussian, $\max_{1\leqslant j\leqslant d}\|Y_{j}\|_{\psi_{2}}\leqslant\sqrt{(r^{2}+\alpha_{n})D_{n}^{2}}$ . Then by the maximal inequality [29, Lemma 2.2.2] $\|\max_{1\leqslant j\leqslant d}|Y_{j}|\|_{\psi_{2}}\leqslant C\sqrt{(r^{2}+\alpha_{n})D_{n}^{2}\log(d)}$ , which further implies that

[TABLE]

Since $n^{-1}\lesssim\varpi_{n}$ , and from the result in Step 1, we have

[TABLE]

Finally, due to Lemma A.12 and (32), we have with probability at least $1-C\varpi_{n}$ ,

[TABLE]

Since $(r^{2}+\alpha_{n})N^{-1}\alpha_{n}^{-1}=r^{2}n^{-1}+N^{-1}\leqslant 2r^{2}(n\wedge N)^{-1}$ , the proof is complete.

Step 3: final step. Recall that $\sqrt{N}U_{n,N}^{\prime}=\sqrt{N}\Phi_{n}+(N/\widehat{N}-1)\sqrt{N}\Phi_{n}$ and $\nu_{n}$ is defined in Step 2. For any rectangle $R=[a,b]$ with $a\leqslant b$ , by Step 2,

[TABLE]

Then by the result in Step 1, we have

[TABLE]

where $\tilde{Y}=\Lambda_{H}^{-1}Y$ , $\tilde{a}=\Lambda_{H}^{-1}a$ and $\tilde{b}=\Lambda_{H}^{-1}b$ . Observe that $\mathbb{E}[(\alpha_{n}^{-1/2}\tilde{Y}_{j})^{2}]\geqslant\mathbb{E}[(\sigma_{H,j}^{-1}Y_{B,j})^{2}]=1$ for $1\leqslant j\leqslant d$ , and thus by anti-concentration inequality [10, Lemma A.1],

[TABLE]

where the last inequality is due to (32). Similarly, we can show $\mathbb{P}\left(\sqrt{N}U_{n,N}^{\prime}\in R\right)\geqslant\mathbb{P}\left(\alpha_{n}^{-1/2}Y\in R\right)-C\varpi_{n}$ , and thus

[TABLE]

which completes the proof.

A.5 Proofs in Section 3

In this subsection, without loss of generality, we assume $q\leqslant 1$ (see Remark 2.5).

A.5.1 Proof of Theorem 3.1

Proof.

Without loss of generality, we can assume $\theta=\mathbb{E}[H(X_{1}^{r},W)]=0$ , since otherwise we can center $H$ first. Recall the definition of $\Lambda_{H}$ in (5), $\tilde{H}(\cdot)=\Lambda_{H}^{-1/2}H(\cdot)$ , and $\Gamma_{\tilde{H}},\;\widehat{\Gamma}_{\tilde{H}}$ in (20). Observe that for any integer $k$ , there exists some constant $C$ that depends only on $k$ and $\zeta$ such that

[TABLE]

Step 0. Define $\tilde{U}_{n,N}^{\prime}:=\Lambda_{H}^{-1/2}U_{n,N}^{\prime}$ and

[TABLE]

Since $\Gamma_{\tilde{H},jj}=1$ for $1\leqslant j\leqslant d$ , by Gaussian comparison inequality [8, C.5],

[TABLE]

Thus it suffices to show that with probability at least $1-C/n$ , $\widehat{\Delta}_{B}\log^{2}(d)\lesssim n^{-\zeta/2}$ . Define

[TABLE]

Then clearly $\widehat{\Delta}_{B}\leqslant|N/\widehat{N}|\left(\widehat{\Delta}_{B,1}+\widehat{\Delta}_{B,2}\right)+\widehat{\Delta}_{B,3}+(N/\widehat{N})^{2}\widehat{\Delta}_{B,4}$ .

Without loss of generality, we can assume $C_{1}n^{-\zeta}\leqslant 1/16$ , since we can always take $C$ to be large enough. Then by Lemma A.12, $\mathbb{P}(|N/\widehat{N}|\leqslant C)\geqslant 1-2n^{-1}$ , and thus it suffices to show that

[TABLE]

on which we now focus.

Step 1: bounding $\widehat{\Delta}_{B,1}$ . Conditional on $\{X_{\iota},W_{\iota}:\iota\in I_{n,r}\}$ , by Lemma A.3,

[TABLE]

where

[TABLE]

First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) and due to (C3’) and Lemma A.10 and A.11,

[TABLE]

As a result, $\mathbb{P}\left(M_{1}\leqslant C\underline{\sigma}_{H}^{-2}r^{2/q}D_{n}^{2}\log^{2/q}(n)\log^{2/q}(dn)\right)\geqslant 1-2/n$ .

Second, we will apply Lemma A.6 to bound $V_{n}$ with $F_{jk}(\cdot)=\tilde{H}_{j}^{2}(\cdot)\tilde{H}_{k}^{2}(\cdot)$ and $\beta=q/4$ . Note that by Lemma A.11, for $1\leqslant j,k\leqslant d$ ,

[TABLE]

As a result, due to (C5), (C3) and (C4)

[TABLE]

Then by Lemma A.6 and A.9, and due to (8) and (33)

[TABLE]

Finally, putting the two results together and again by (33), we have

[TABLE]

Then by (8), $\mathbb{P}\left(\widehat{\Delta}_{B,1}\;\leqslant\;C\underline{\sigma}_{H}^{-1}N^{-1/2}r^{1/q}\log^{1/2}(dn)D_{n}\right)\geqslant 1-C/n$ , which implies that with probability at least $1-C/n$ ,

[TABLE]

Step 2: bounding $\widehat{\Delta}_{B,2}$ . By Lemma A.13 and A.9, and due to assumptions (8) and (33)

[TABLE]

which implies $\mathbb{P}(\widehat{\Delta}_{B,2}\log^{2}(d)\leqslant Cn^{-\zeta/2})\geqslant 1-13/n$ .

Step 3: bounding $\widehat{\Delta}_{B,3}$ . By definition, $\|\Gamma_{\tilde{H}}\|_{\infty}=1$ . Then by Lemma A.12 and (8),

[TABLE]

with probability at least $1-2n^{-1}$ .

Step 4: bounding $\widehat{\Delta}_{B,4}$ . Define

[TABLE]

Clearly, $\widehat{\Delta}_{B,4}\leqslant 2\left(\widehat{\Delta}^{2}_{B,5}+\widehat{\Delta}^{2}_{B,6}\right).$ In the next two sub-steps, we will bound these two terms separately.

Step 4.1: bounding $\widehat{\Delta}_{B,5}^{2}$ . Conditional on $\{X_{\iota},W_{\iota}:\iota\in I_{n,r}\}$ , by Lemma A.3,

[TABLE]

where $\widetilde{V}_{n}:=\max_{1\leqslant j\leqslant d}|I_{n,r}|^{-1}\sum_{\iota\in I_{n,r}}\tilde{H}_{j}^{2}(X_{\iota},W_{\iota}),\quad\widetilde{M}_{1}:=\max_{\iota\in I_{n,r}}\max_{1\leqslant j\leqslant d}|\tilde{H}_{j}(X_{\iota},W_{\iota})|$ .

First, by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11) and due to (C3’),

[TABLE]

As a result, $\mathbb{P}\left(\widetilde{M}_{1}\leqslant C\underline{\sigma}_{H}^{-1}r^{1/q}D_{n}\log^{1/q}(n)\log^{1/q}(dn)\right)\geqslant 1-2/n$ .

Second, we will apply Lemma A.6 to bound $\widetilde{V}_{n}$ with $F_{j}(\cdot)=\tilde{H}_{j}^{2}(\cdot)$ and $\beta=q/2$ . Define for $1\leqslant j\leqslant d$ ,

[TABLE]

By the similar argument as in Step 1,

[TABLE]

Then by Lemma A.6 and A.9, and due to (8) and (33) we have $\mathbb{P}(\widetilde{V}_{n}\leqslant C)\geqslant 1-8/n$ .

Finally, putting the two results together, we have

[TABLE]

Then by (8), $\mathbb{P}\left(\widehat{\Delta}_{B,5}^{2}\;\leqslant\;CN^{-1}\log(dn)\right)\geqslant 1-C/n$ , which implies that with probability at least $1-C/n$ , $\widehat{\Delta}_{B,5}^{2}\log^{2}(d)\;\leqslant\;Cn^{-\zeta}$ holds.

Step 4.2: bounding $\widehat{\Delta}_{B,6}^{2}$ . Observe that $\widehat{\Delta}_{B,6}\leqslant\widehat{\Delta}_{B,7}+\widehat{\Delta}_{B,8}$ , where

[TABLE]

By directly applying Lemma A.7 with $\beta=q$ , due to (8) and Lemma A.9,

[TABLE]

By directly applying Lemma A.5 with $\beta=q$ and due to (8),

[TABLE]

Thus $\mathbb{P}\left(\widehat{\Delta}_{B,6}^{2}\log^{2}(d)\leqslant Cn^{-\zeta}\right)\geqslant 1-C/n$ .

Combining sub-step 4.1 and 4.2, we have $\mathbb{P}\left(\widehat{\Delta}_{B,4}^{2}\log^{2}(d)\leqslant Cn^{-\zeta}\right)\geqslant 1-C/n$ . And combining Step 0-4, we finish the proof. $\blacksquare$

A.5.2 Proof of Lemma 3.2

Proof.

Without loss of generality, we can assume $\theta=\mathbb{E}[H(X_{1}^{r},W)]=0$ . Recall the definition $\Lambda_{g}$ is (5). By definition, $\mathbb{E}[(\sigma_{g,j}^{-1}Y_{A,j})^{2}]=1$ for $1\leqslant j\leqslant d$ . Then by the Gaussian comparison inequality [8, Lemma C.5],

[TABLE]

where

[TABLE]

By the same argument as in the proof of [8, Theorem 4.2],

[TABLE]

where $\widehat{\Delta}_{A,1}$ is defined in (9), and

[TABLE]

Step 1: bounding $\widehat{\Delta}_{A,1}$ . By the second part of (11), we have

[TABLE]

Step 2: bounding $\widehat{\Delta}_{A,2}$ . We apply Lemma A.2 with $\beta=q/2$ , $m=n_{1}$ and note that $n_{1}\leqslant n$ :

[TABLE]

where $\sigma^{2}=\max_{1\leqslant j,k\leqslant d}\sigma_{g,j}^{-2}\sigma_{g,k}^{-2}\sum_{i_{1}\in S_{1}}\mathbb{E}[(g_{j}(X_{i_{1}})g_{k}(X_{i_{1}})-\Gamma_{g,jk})^{2}]$ and

[TABLE]

By Lemma A.11, (C2) and (C3’), $\sigma^{2}\leqslant n_{1}\left(\underline{\sigma}_{g}^{-1}D_{n}\right)^{2},\;u_{n}\leqslant\left(\underline{\sigma}_{g}^{-1}D_{n}\right)^{2}$ . Thus

[TABLE]

Then due to the first part of (11) and (33), $\mathbb{P}(\widehat{\Delta}_{A,2}\log^{2}(d)\geqslant Cn^{-\zeta_{1}/2})\leqslant Cn^{-1}$ .

Step 3: bounding $\widehat{\Delta}_{A,3}$ . We apply Lemma A.2 with $\beta=q$ , $m=n_{1}$ :

[TABLE]

Then due to the first part of (11) and (33), $\mathbb{P}(\widehat{\Delta}_{A,3}^{2}\log^{2}(d)\geqslant Cn^{-\zeta_{1}})\leqslant Cn^{-1}$ . $\blacksquare$

A.5.3 Proof of Theorem 3.3

Proof.

Without loss of generality, we can assume $\theta=\mathbb{E}[H(X_{1}^{r},W)]=0$ .

Step 1. Let $\zeta_{1}:=\zeta$ , $\zeta_{2}:=\zeta-1/\nu$ . Due to Theorem 3.1, Lemma 3.2 and using the same argument as in the Step 3 of the proof of [8, Theorem 4.2], it suffices to show the second part of (11) holds. From the definition (9),

[TABLE]

In Step 2, we will show that

[TABLE]

Then by Markov inequality and (12),

[TABLE]

which completes the proof.

Step 2. The goal is to show (34). Define

[TABLE]

By Jensen’s inequality,

[TABLE]

and for each $i_{1}\in S_{1}$ , conditional on $X_{i_{1}}$ , by Hoffmann-Jorgensen inequality [29, A.1.6.],

[TABLE]

Step 2.1: bounding $II_{i_{1}}$ . Observe that for each $1\leqslant k\leqslant K$ ,

[TABLE]

Thus $II_{i_{1}}\lesssim K^{-2\nu+1}b(X_{i_{1}})$ .

Step 2.2: bounding $I_{i_{1}}$ . Observe that for each $i_{1}\in S_{1}$ ,

[TABLE]

Further, by Jensen’s inequality,

[TABLE]

where $b(X_{i_{1}})$ is defined in Step 1. Then by the same argument as in the proof of [8, Proposition 4.4],

[TABLE]

Step 2.3: combining 2.1 and 2.2. By Jensen’s inequality, assumption (C3’) and by the maximal inequality ([29, Lemma 2.2.2] and Lemma A.11)

[TABLE]

Thus combining the results from 2.1 and 2.2, we have

[TABLE]

where the second inequality is due to (12) and that $\nu\geqslant 7/6$ and $K=\lfloor(n-1)/(r-1)\rfloor$ .

$\blacksquare$

A.5.4 Proof of Corollary 3.5

Proof.

We have shown in Step 0 of the proof (Subsection A.5.1) for Theorem 3.1 that

[TABLE]

Further, if we take $\nu=7/\zeta$ in Theorem 3.3, then in the proof for Theorem 3.2 and Theorem 3.3, we have shown that

[TABLE]

The rest of the proof is the same as the proof for [8, Corollary A.1], and thus omitted. $\blacksquare$

A.6 Proof of Lemma 4.1

Proof.

Clearly, the inequality is for each dimension, and thus without loss of generality, we assume $d=1$ and omit the dependence on $j$ .

We denote $\mathbb{E}_{\beta}$ and $\text{Cov}_{\beta}$ the expectation and covariance when $X_{1},\ldots,X_{r}$ have densities $f_{\beta}$ . Further, define $g_{\beta}(x_{1})=\mathbb{E}_{\beta}[h(x_{1},X_{2},\ldots,X_{r})]$ for $x_{1}\in S$ and by definition $g(\cdot)=g_{0}(\cdot)$ .

First, note that by interchanging the order of integration and differentiation

[TABLE]

Further, by a similar argument,

[TABLE]

which implies that

[TABLE]

Finally, observe that

[TABLE]

which completes the proof. $\blacksquare$

A.7 Proofs of tail probabilities in Section A.1

A.7.1 Proof of Lemma A.1

Proof.

We first define

[TABLE]

Then by the maximal inequality [29, Lemma 2.2.2], $\|M\|_{\psi_{\beta}}\leqslant Cu_{n}\log^{1/\beta}(dm)$ . By [10, Lemma E.4],

[TABLE]

The right hand side is $3/n$ if

[TABLE]

Further by [10, Lemma E.3],

[TABLE]

Combining two parts finishes the proof. $\blacksquare$

A.7.2 Proof of Lemma A.2

Proof.

We first define

[TABLE]

Then by the maximal inequality [29, Lemma 2.2.2], $\|M\|_{\psi_{\beta}}\leqslant Cu_{n}\log^{1/\beta}(dm)$ . By [10, Lemma E.2],

[TABLE]

The right hand side is $4/n$ if

[TABLE]

Further by [10, Lemma E.1],

[TABLE]

Combining two parts finishes the proof.

$\blacksquare$

A.7.3 Proof of Lemma A.3

Proof.

We first define

[TABLE]

By [10, Lemma E.2],

[TABLE]

The right hand side is $4/n$ if

[TABLE]

Further by [10, Lemma E.1],

[TABLE]

Combining two parts finishes the proof. $\blacksquare$

A.7.4 Proof of Lemma A.4

Proof.

Let $m=\lfloor n/r\rfloor$ , and define the following quantity

[TABLE]

Then by the maximal inequality [29, Lemma 2.2.2], $\|M_{1}\|_{\psi_{\beta}}\leqslant Cu_{n}\log^{1/\beta}(dn)$ . By [6, Lemma E.3],

[TABLE]

The right hand side is $3/n$ if we set

[TABLE]

Further, by [9, Lemma 9],

[TABLE]

Putting two parts together, we have

[TABLE]

which completes the proof. $\blacksquare$

A.7.5 Proof of Lemma A.5

Proof.

Let $m=\lfloor n/r\rfloor$ , and define the following quantity

[TABLE]

Then by the maximal inequality [29, Lemma 2.2.2], $\|M_{1}\|_{\psi_{\beta}}\leqslant Cu_{n}\log^{1/\beta}(dn)$ . By [8, Lemma C.3],

[TABLE]

The right hand side is $4/n$ if we take

[TABLE]

Further, by [9, Lemma 8],

[TABLE]

Putting two parts together completes the proof. $\blacksquare$

A.7.6 Proof of Lemma A.6

Proof.

First, observe that $\|F_{j}(x_{1}^{r},W)\|_{\psi_{\beta}}\;\lesssim\;f_{j}(x_{1}^{r})+b_{j}(x_{1}^{r})$ . Denote

[TABLE]

Then conditional on $X_{1}^{n}$ , by Lemma A.1,

[TABLE]

By Lemma A.4,

[TABLE]

Further, by maximal inequality [29, Lemma 2.2.2]

[TABLE]

Then the proof is complete by combining above results. $\blacksquare$

A.7.7 Proof of Lemma A.7

Proof.

First, we define

[TABLE]

Then by first conditional on $X_{1}^{n}$ and by Lemma A.2,

[TABLE]

Observe that

[TABLE]

Then by Lemma A.4 with $\psi_{\beta/2}$ ,

[TABLE]

Further, by maximal inequality [29, Lemma 2.2.2]

[TABLE]

Then the proof is complete by combining above results. $\blacksquare$

A.8 Proofs of additional lemmas

The following lemma is similar to [10, Lemma C.1], and is needed in proving Lemma A.8.

Lemma A.15.

Let $q\in(0,3]$ , and $\xi$ be a non-negative random variable such that $\|\xi\|_{\psi_{q}}\leqslant D$ . Then there exists a constant $C$ , depending only on $q$ , such that

[TABLE]

Proof.

Since $\|\xi\|_{\psi_{q}}\leqslant D$ , we have for $x>0$ ,

[TABLE]

By change of variable, we have

[TABLE]

$\blacksquare$

Proof of Lemma A.8.

For $q\geqslant 1$ , it has been established by [10, Proposition 2.1]. For $q<1$ , the proof is almost identical to that for [10, Proposition 2.1], except that we replace [10, Lemma C.1] by Lemma A.15. $\blacksquare$

Proof of Lemma A.11.

(i). Without loss of generality, we assume $0<x:=\|X\|_{\psi_{\beta}}<\infty$ , and $0<y:=\|Y\|_{\psi_{\beta}}<\infty$ . Observe that

[TABLE]

(ii). From Lemma 5.4, for $1\leqslant i\leqslant n$ ,

[TABLE]

which, by the convexity of $\widetilde{\psi}_{\beta}$ and the fact $\widetilde{\psi}_{\beta}(0)=0$ , implies $\|\xi_{i}\|_{\widetilde{\psi}_{\beta}}\leqslant 2D$ . By the standard maximal inequality (e.g., see [29, Lemma 2.2.2]) and Lemma 5.4, $\|\max_{1\leqslant i\leqslant n}\xi_{i}\|_{\widetilde{\psi}_{\beta}}\leqslant C\log^{1/\beta}(n)D$ . Thus by Lemma 5.4,

[TABLE]

Now we let $m\geqslant 1$ such that $\left(1+e^{1/\beta}\right)^{1/m}\leqslant 2$ . Then by Jensen’s inequality ( $\mathbb{E}[X^{1/m}]\leqslant\left(\mathbb{E}[X]\right)^{1/m}$ for $X>0$ a.s.),

[TABLE]

which implies that $\|\max_{1\leqslant i\leqslant n}\xi_{i}\|_{\widetilde{\psi}_{\beta}}\lesssim\log^{1/\beta}(n)D$ . $\blacksquare$

Acknowledgements

X. Chen is supported in part by NSF DMS-1404891, NSF CAREER Award DMS-1752614, and UIUC Research Board Awards (RB17092, RB18099).

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Gunnar Blom. Some properties of incomplete U 𝑈 U -statistics. Biometrika , 63(3):573–580, 1976.
2[2] Yu. V. Borovskikh. U-Statistics in Banach Spaces . V.S.P. Intl Science, 1996.
3[3] Leo Breiman. Bagging predictors. Machine Learning , 24:123–140, 1996.
4[4] Leo Breiman. Random forests. Machine Learning , 45:5–32, 2001.
5[5] B.M. Brown and D.G. Kildea. Reduced U 𝑈 U -statistics and the Hodges-Lehmann estimator. Annals of Statistics , 6:828–835, 1978.
6[6] Xiaohui Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and their applications. The Annals of Statistics , 46(2):642–678, 2018.
7[7] Xiaohui Chen and Kengo Kato. Jackknife multiplier bootstrap: finite sample approximations to the U 𝑈 U -process supremum with applications. 2017. ar Xiv:1708.02705.
8[8] Xiaohui Chen and Kengo Kato. Randomized incomplete u 𝑢 u -statistics in high dimensions. The Annals of Statistics, accepted (available at ar Xiv:1712.00771) , 2018+.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Approximating high-dimensional infinite-order UUU-statistics: statistical and computational guarantees

Abstract

keywords:

1 Introduction

Example 1.1** (Simultaneous prediction intervals for random forests).**

1.1 Connections to the literature

1.2 Notation

2 Gaussian approximations for IOUS

2.1 IOUS with random kernel

Theorem 2.1**.**

Proof.

Corollary 2.2**.**

Remark 2.3** (Comparisons with existing results for d=1d=1d=1).**

2.2 Incomplete IOUS with random kernel

Theorem 2.4**.**

Proof.

Remark 2.5**.**

3 Bootstrap approximations

3.1 Bootstrap for ΓH\Gamma_{H}ΓH​

Theorem 3.1**.**

Proof.

3.2 Bootstrap for the approximating Gaussian distribution

Lemma 3.2**.**

Proof.

Theorem 3.3**.**

Proof.

3.3 Simultaneous confidence intervals

Corollary 3.4**.**

Proof.

Corollary 3.5**.**

Proof.

Remark 3.6**.**

4 Applications

4.1 Lower bound for σ‾g\underline{\sigma}_{g}σ​g​

Lemma 4.1**.**

Proof.

4.2 Simultaneous prediction intervals for random forests

Remark 4.2** (Fisher information in nonparametric regressions).**

4.3 Expected maximum and log-mean functionals

Example 4.3**.**

4.4 Kernel density estimation

Example 4.4** (Kernel density estimation).**

Remark 4.5**.**

5 Maximal inequality

Theorem 5.1**.**

5.1 Symmetrization inequality

Theorem 5.2** (Symmetrization inequality).**

Remark 5.3**.**

Proof of Theorem 5.2.

5.2 Maximal inequality

Lemma 5.4**.**

Lemma 5.5**.**

Proof.

5.3 Exponential moment of Rademacher chaos

Lemma 5.6** (Exponential moment of Rademacher chaos).**

Proof.

5.4 Proof of Theorem 5.1

Appendix A Proofs

A.1 Tail probabilities

A.1.1 Tail probabilities for sum of independent random vectors

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Lemma A.3**.**

Proof.

A.1.2 Tail probabilities for UUU-statistics

Lemma A.4**.**

Proof.

Lemma A.5**.**

Proof.

A.1.3 Tail probabilities for UUU-statistics with random kernel

Lemma A.6**.**

Proof.

Approximating high-dimensional infinite-order $U$ -statistics: statistical and computational guarantees

Example 1.1 (Simultaneous prediction intervals for random forests).

Theorem 2.1.

Corollary 2.2.

Remark 2.3 (Comparisons with existing results for $d=1$ ).

Theorem 2.4.

Remark 2.5.

3.1 Bootstrap for $\Gamma_{H}$

Theorem 3.1.

Lemma 3.2.

Theorem 3.3.

Corollary 3.4.

Corollary 3.5.

Remark 3.6.

4.1 Lower bound for $\underline{\sigma}_{g}$

Lemma 4.1.

Remark 4.2 (Fisher information in nonparametric regressions).

Example 4.3.

Example 4.4 (Kernel density estimation).

Remark 4.5.

Theorem 5.1.

Theorem 5.2 (Symmetrization inequality).

Remark 5.3.

Lemma 5.4.

Lemma 5.5.

Lemma 5.6 (Exponential moment of Rademacher chaos).

Lemma A.1.

Lemma A.2.

Lemma A.3.

A.1.2 Tail probabilities for $U$ -statistics

Lemma A.4.

Lemma A.5.

A.1.3 Tail probabilities for $U$ -statistics with random kernel

Lemma A.6.

Lemma A.7.

Lemma A.8.

Lemma A.9.

Lemma A.10.

Lemma A.11.

A.4.1 Bounding $\widehat{N}/N$

Lemma A.12.

Lemma A.13.

Lemma A.14.

Lemma A.15.