Asymptotic power of Rao's score test for independence in high dimensions
Dennis Leung, Qi-Man Shao

TL;DR
This paper analyzes the asymptotic power of Rao's score test for independence in high-dimensional normal data, showing it is rate-optimal for detecting dependencies as both sample size and dimension grow.
Contribution
It derives the asymptotic minimax power function of Rao's score test in high dimensions, establishing its rate-optimality for dependency detection.
Findings
Rao's score test is rate-optimal for dependency signals of order sqrt(m/n)
The test's power function is characterized asymptotically in high dimensions
Both dimension and sample size tend to infinity with bounded ratio
Abstract
Let be the Pearson correlation matrix of normal random variables. The Rao's score test for the independence hypothesis , where is the identity matrix of dimension , was first considered by Schott (2005) in the high dimensional setting. In this paper, we study the asymptotic minimax power function of this test, under an asymptotic regime in which both and the sample size tend to infinity with the ratio upper bounded by a constant. In particular, our result implies that the Rao's score test is rate-optimal for detecting the dependency signal of order , where is the matrix Frobenius norm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Statistical Methods and Inference · Statistical Methods and Bayesian Inference
Asymptotic power of Rao’s score test for independence in high dimensions
Dennis Leung
Department of Statistics, Chinese University of Hong Kong, Shatin, Hong Kong
and
Qi-Man Shao
Department of Statistics, Chinese University of Hong Kong, Shatin, Hong Kong
Abstract.
Let be the Pearson correlation matrix of normal random variables. The Rao’s score test for the independence hypothesis , where is the identity matrix of dimension , was first considered by Schott (2005) in the high dimensional setting. In this paper, we study the asymptotic exact power function of this test, under an asymptotic regime in which both and the sample size tend to infinity with the ratio upper bounded by a constant. In particular, our result implies that the Rao’s score test is minimax rate-optimal for detecting the dependency signal of order , where is the matrix Frobenius norm.
2000 Mathematics Subject Classification:
62H05
1. Introduction
Let be an -variate normal vector with population Pearson correlation matrix denoted by . Suppose we observe independent samples for each component , . When the dimension can be larger than the sample size , Schott (2005) was the first to consider the Rao’s score statistic
[TABLE]
for testing the independence null hypothesis
[TABLE]
where , is the sample correlation of the pair computed from the data, and is the -by- identity matrix. It was shown to be asymptotically normal under as both and go to infinity with the ratio converging to a positive constant. The purpose of this paper is to complement the theoretical study of by investigating its power under alternatives of the form
[TABLE]
where for any constant and matrix Frobenius norm , we define the set of Pearson correlation matrices
[TABLE]
which comprises a composite alternative hypothesis delineated by a signal size of order no less than .
There are three major approaches to testing independence with growing dimension in the literature, to the best of our knowledge. The first is the statistic considered in this paper. Being a “sum” of squared pairwise sample correlation as in (1.1), it is good at detecting diffuse dependency among many pairs of variables. Such dependency is most naturally described by the signal . In fact, the main result in this paper will show that is minimax rate optimal for detecting such signal. The second approach considers the “max” statistic,
[TABLE]
Following many previous works (Jiang, 2004, Liu et al., 2008, Li et al., 2010, Zhou, 2007, Li et al., 2012), Cai and Jiang (2011) showed that it admits an asymptotic Gumbel distribution under in the ultra high dimensional regime when can be as large as for some constant , as . Naturally, it is good at detecting a structured alternative whose population correlation matrix has sparse non-zero off-diagonal entries with considerable magnitudes. Both the “sum” and “max” approaches base their test on forming intuitive statistics that measure the overall dependency among the variables, with their respective non-parametric extensions; see Leung and Drton (2015) and Han and Liu (2014). The third is likelihood ratio test (LRT), which is well-known to give implementable test only if the dimension is smaller than . Despite this limitation, Jiang and Qi (2015) showed the LRT statistic to be asymptotically normal when , as long as is less than .
We remark that the derivation of (1.1) as the Rao’s score statistic involves taking derivatives of the log-normal likelihood with respect to the mean vector and the precision matrix. The interested reader is referred to Appendix A in Leung and Drton (2015) for those calculations.
2. Notations and main results
For any positive integer , is defined as the set . is the symmetric group of order . Depending on the context, its elements will sometimes be treated as permutation functions on elements, or simply permutations of the set . always denotes a positive constant that is universal, i.e, its value may change from place to place but does not depend on and . “” means that for some constant . , and are expectation, variance and probability operators respectively.
In this paper we shall always assume that, for all , and . Thus, for a duple , , and its corresponding squared sample correlation is defined as
[TABLE]
where is the function
[TABLE]
and
[TABLE]
We will also use
[TABLE]
to denote the centered sample covariance. Imposing the assumption is always permitted, even if we use the more general form of Pearson correlations with all sample covariances defined alternatively as
[TABLE]
in (2.1), since the distribution of is invariant to the scaling of variables. Under normality, the restrictions and (2.3) can be still be assumed without forgoing any generality of our results to follow; see the classical result in Anderson (2003, Theorem 3.3.2).
According to Chen and Shao (2012, Theorem 2.2) who refined the asymptotic result of Schott (2005) under , for a given , a test of asymptotic level based on (1.1) is given as
[TABLE]
where is the indicator function , , and and are respectively the cumulative distribution function and tail probability of a standard normal variate. Below, simply emphasizes that the expectation is taken with respect to a particular correlation matrix .
Theorem 2.1** (Main result: asymptotic power).**
Suppose such that for some constant . For any significance level , the asymptotic power of is given as
[TABLE]
This theorem resembles Cai and Ma (2013, Theorem ), in which the different problem of testing , where is the covariance matrix of , is studied. Despite this, Theorem and Remark in their paper indicate that a matching lower bound on the detectable signal size as measured by can be established for our problem (1.2), which we restate next for our readers’ convenience. We add that Theorem 2.1 is slightly weaker than the parallel result of Cai and Ma (2013) in that an upper bound on the ratio is imposed, which we believe to be merely a proof artifact not necessary for the theorem to hold. Discussion on this will be deferred later.
Theorem 2.2** (Matching lower bound, Cai and Ma (2013)).**
Let . Suppose such that for some constant . Then there exists a constant , such that
[TABLE]
for any test with significance level for testing .
The lower bound result says that no -level test for can achieve a preset target power if the signal size falls below a certain threshold modulo the separation rate . Our main result in Theorem 2.1 hence suggests that our test is “rate” optimal when the ratio is bounded, since the asymptotic power tends to one as .
Although the result in Theorem 2.1 is neat, its proof, which occupies the rest of this paper, is quite involved. As it will become clear later, this is because our statistic is constructed with Pearson correlations whose higher order moment properties involve a lot of computations to be understood; see Hotelling (1953, Section 7) for classical work on this. At some point in this paper we will use mathematica to help us with certain symbolic calculations. We shall begin with a Taylor expansion of the expression for in terms of the function in (2.1). We need the multi-index notations: For a vector of non-negative integers, and , and if is a function in arguments, \partial^{\boldsymbol{\lambda}}g(\tilde{u}_{1},\dots,\tilde{u}_{k})=\frac{\partial^{|{\boldsymbol{\lambda}}|}g}{\partial u_{1}^{\lambda_{1}}\dots\partial u_{k}^{\lambda_{k}}}\big{|}_{u_{i}=\tilde{u}_{i}} is its partial derivative with respect to evaluated at the point . Since , by Taylor’s theorem, for each pair ,
[TABLE]
where
[TABLE]
for some , is the remainder in Lagrange’s form. The “almost surely” qualifier is in (2.6) because on an event of measure zero, either or may be zero, in which case the Taylor’s theorem doesn’t apply since is defined on . Our proof depends crucially on recognizing that, when ,
[TABLE]
in light of Lemma B.1 which specifies the partial derivatives of . One can then equivalently write (2.6) as
[TABLE]
where
[TABLE]
[TABLE]
Defining , and by summing over all , from (2.8) one can write
[TABLE]
realizing that . We are now in the position to introduce three supporting lemmas that are the building blocks of Theorem 2.1. The first lemma gives a Berry-Esseen bound for the cumulative distribution function of the term with after standardization. This will ultimately drive the form of our power function in Theorem 2.1. The next two lemmas control the variability of the extra terms, and . From now on for the rest of this paper all the big , little notations are with respect to our considered asymptotic regime , .
Lemma 2.3** (Berry Esseen theorem for ).**
The following are true for :
- (i)
Variance:
[TABLE]
for any . 2. (ii)
Berry-Esseen bound:
[TABLE]
Lemma 2.4** (Bound on the 2nd moment of ).**
[TABLE]
for any fixed .
Lemma 2.5** (Probability bound for ).**
For any , there exists such that
[TABLE]
for large enough .
The proofs of Lemmas 2.3 and 2.4 are separately given in the next two sections. Lemma 2.5 is proved by a standard maximal inequality in Appendix A. With these tools we can now establish Theorem 2.1 based on the general approach laid out in Cai and Ma (2013).
Proof of Theorem 2.1.
From (2.5) and (2.11) the power of our test can be written as
[TABLE]
By dividing the set into two subsets
[TABLE]
and
[TABLE]
where is a sufficiently large constant depending on , it suffices to show
[TABLE]
and
[TABLE]
as , . Together, they lead to the theorem since (2.15) implies that
[TABLE]
To prove (2.14) we first suppose that is larger than , and let be any positive constant satisfying . By definition, for any , it must be the case that for some . Together with the fact that and which are consequences of the choice of , by a union bound and Chebyshev’s inequality we continue from (2.13) and obtain
[TABLE]
Substituting for into the bounds for and in Lemmas 2.3 and 2.4, it is seen that the first term in (2.16) is bounded by a term of order
[TABLE]
Moreover, the second term in (2.16) converges to [math] as by Lemma 2.5 since is larger than asymptotically for any constant , given that . They together imply that the constant can be taken large enough so that
[TABLE]
which is equivalent to (2.14).
To show (2.15), the uniform convergence of power on the “stripe” of alternatives with the signal bounded from above and below in size, we shall first establish that
[TABLE]
uniformly over the set , where
[TABLE]
and is any number such that . By a union bound we have
[TABLE]
for any and large enough . The last inequality comes from the Chebyshev inequality and the fact that, by taking in Lemma 2.5, for large enough , under , we have
[TABLE]
where the constant is same as the one in Lemma 2.5. Since , it must be that for some , and substituting this into the variance bound in Lemma 2.4 it can be easily seen that
[TABLE]
uniformly over as , . This gives (2.17) since in (2.18).
To finish the proof of (2.15), by union bound arguments one has
[TABLE]
and
[TABLE]
which collectively imply
[TABLE]
since for any and . Moreover, all three terms on the right hand side of (2.20) are of order uniformly over . The first two terms are so by Lemma 2.3 and (2.17), and the last term is so since by Lemma 2.3, where the term is also uniform over . Finally, by Lemma 2.3 as , , we also have
[TABLE]
and it is not hard to see that this implies
[TABLE]
Applying these facts to (2.20) leads to (2.15). ∎
In establishing the normal tail form of our power function, perhaps the most important step is singling out as the main term that drives the asymptotic normality of the left hand side in (2.11) under the “stripe” of alternative via the Berry-Esseen bound in Lemma 2.3. We note that is already a rather simple term to handle, but proving Lemma 2.3 for it still takes considerable effort in the next section. Moreover, has been used at different places, the convergences in (2.19) and (2.21) for instances. However, the assumption is mostly a convenient one for such statements regarding terms and , since the estimates presented in Lemmas 2.3 and 2.4 are not the sharpest possible, for either aesthetic purpose or saving us some effort on refining them in the next two sections.
It is the remainder term that truly prevents us from removing the upper bound on . In order to show it tends to zero in probability, as in (2.18), we applied the crude tail bound in Lemma 2.5 based on a maximal inequality (see Appendix A). Such an estimate doesn’t take the correlations among the constituent summands into account, as was done for the ’s with respective to via explicitly estimating its second moment in Lemma 2.4. The major obstacle to computing is the random coefficients
[TABLE]
attached to the products in definition (2.7). Unlike , where the constituents have constant coefficients, not only is the coefficient in (2.22) a rational functions in , , , but it also involves the intractable random quantity . As such, there is no straightforward way of applying Isserlis’s theorem (Theorem B.2) to compute the moment like we did for in Section 4. In fact, even with the help of mathematica, it still took us substantial effort to get our bound in Lemma 2.4 as seen later. At this moment, we cannot think of other ways to control term .
3. The Berry Esseen bound for
We will prove Lemma 2.3 in this section. For our presentation, given a finite set and duples indexed by a subscript that ranges over , we define the central moment quantities
[TABLE]
Recall that is defined as , where each is given in (2.9). We first observe that has a natural martingale structure: For each , let be the sigma-algebra generated by and be the trivial sigma algebra, and define
[TABLE]
as well as
[TABLE]
Then , and is a the sequence of martingale differences since
[TABLE]
for , where is trivial for .
With the observations just made it is easy to see that and
[TABLE]
By the i.i.d.’ness of the samples, for each ,
[TABLE]
where, to clarify, means a summation over all pairs of duples such that for each . We have the equality in (3.4) because equals when and zero otherwise. For , let
[TABLE]
correspond to a sum over all duples , such that as a set has cardinality . From (3.3) and (3.4) we can write
[TABLE]
since . In Appendix C, we will show the following estimates hold:
[TABLE]
Substituting these into (3.6) results in Lemma 2.3. In fact, this general strategy of decomposing a sum according to the cardinality of an index set as in (3.5) and forming separate estimates will be employed repeatedly in the sequel.
We shall now prove the normal approximation in Lemma 2.3. With a Berry-Esseen theorem for martingale central limit theorem in Heyde and Brown (1970), it suffices to verify the fourth moment conditions
[TABLE]
and
[TABLE]
Note that the equality before (3.11) holds because .
We will first show (3.10). For any , on raising to the th power and taking expectation, by the i.i.d.’ness of samples, we have
[TABLE]
where the summations and are defined similarly as the one in (3.4). The last equality in (3.12) is explained as follows: For a fixed and a given set of variables index pairs , with any choice of the sample indices in order for the expectation
[TABLE]
to be non-zero, by independence it must be true that there exists a permutation function so that
[TABLE]
Since the condition in (3.14) implies that , at most many expectations in (3.13) can be non-zero. This leads to (3.12) since the expectations in (3.13), when they are non-zero, can be uniformly bounded regardless of the choice for , owing to our assumptions at the beginning of Section 2 and Theorem B.2 on higher order normal moments. Provided that , with (3.12) we further write
[TABLE]
Now the last term in (3.15) can be decomposed, according to the cardinality of the set of duples , as
[TABLE]
where for ,
[TABLE]
and the term comes from the fact that there are only many uniformly bounded extra summands under the restriction . In Appendix C we will show that
[TABLE]
for each . Collecting (3.15), (3.16) and (3.17) we get (3.10).
To show (3.11) it suffices to understand the term since the form of has been proven in Lemma 2.3. On expansion,
[TABLE]
Proceeding with our calculations,
[TABLE]
where
[TABLE]
By independence, we note that the expression
[TABLE]
on the right hand side of (3.19) can be non-zero only if the four sample indices are such that either
[TABLE]
[TABLE]
[TABLE]
or
[TABLE]
For any fixed given pair , by simple counting, there are, respectively, , , , combinations of that satisfy (3.21), (3.22), (3.23), (3.24) for which and , where and . Hence,
[TABLE]
where
[TABLE]
are the value of when satisfy the criteria (3.23) and (3.24) respectively. Substituting (3.25) into (3.19) gives
[TABLE]
where the terms in(3.25) are absorbed into the first term because they are uniformly bounded regardless of the choice of , again by our assumptions and Theorem B.2. From this it remains to show the estimates
[TABLE]
[TABLE]
and
[TABLE]
which, together with Lemma 2.3 and (3.26), imply (3.11). The proofs of these estimates will, again, be deferred to Appendix C.
4. The second moment bound for
We will now prove Lemma 2.4. Recall that , and from the definition of in (2.10) we can equivalently write it as
[TABLE]
where
[TABLE]
and
[TABLE]
We form this grouping of terms for reasons that will be explained later. As such, by defining and , one can write
[TABLE]
To finish the proof of Lemma 2.4, it suffices to bound the second moments of and respectively in terms of .
Lemma 4.1** (Bound on the second moment of ).**
[TABLE]
for any .
Lemma 4.2** (Bound on the second moment of ).**
[TABLE]
for any .
Using Lemmas 4.1 and 4.2, Lemma 2.4 immediately follows from
(i) and (ii) .
For each pair , the main difference between and is that when , all the coefficients appearing in the second term of (4.2) can be bounded by either or up to some multiplicative constants. This makes proving the useful bound for in terms of the norm amenable to the straightforward approach of squaring and taking expectation. Thus we shall defer the proof of Lemma 4.2 to Appendix D and address the bound in Lemma 4.1 for the rest of this section.
We will start with the fact that
[TABLE]
and form estimates for the terms on the right hand side. To understand the mean and variance of , it is more instructive to first recognize that each term in (4.1) can be written as a U-statistic of degree . For instance, for any four distinct indices , if we only treat as a four tuple in , the function
[TABLE]
is symmetric in its four arguments, and the first term in (4.1) can be written as the U-statistic
[TABLE]
where the summation on the right hand side is over all distinct unordered qradruples that can be formed from . We note that the factor appears as a denominator in (4.5) because for each , the summand will appear only once on the left hand side of (4.6), while by the definition of it will appear in kernels that are summed over on the right hand side of (4.6) (Since for each , there will be choices of to form a quadruple from ). Thus, the factor appears as a denominator in definition (4.5) to account for the multiple counting.
Note that the other terms of the form in (4.1) are indexed by equal to , , , . These terms can be represented as U-statistics of degree using a similar strategy: With four distinct indices from , by defining the symmetric kernel function
[TABLE]
for , where above we interpret as permutation functions on distinct elements, we have the U-statistic representation of degree
[TABLE]
Note that (4.8) simply comes from Lemma B.1. What we have done here is that, for each term in (4.8) with not necessarily distinct, we find any distinct indices that contain as sets, and arrange the term into one of the three summands of order , and in (4.7) according to the actual set cardinality , which can be equal to , or . Since there are choices of distinct that contain as sets, to account for the duplications we put the factors , , as denominators for the three summands in the definition (4.7) of the kernel. By a simple symmetry argument if we define the kernel
[TABLE]
where , we have
[TABLE]
In the same vein, for equals or and four distinct indices from , we leave it to the reader to check that one can define a symmetric kernel of degree as shown in Appendix D such that
[TABLE]
and
[TABLE]
where
[TABLE]
Letting denote the entire -th sample, we have the degree- U-statistic representation for :
[TABLE]
where
[TABLE]
Hence,
[TABLE]
The expectation for each of in the preceding display can be computed by taking expectation for each of the product terms appearing in in definitions (4.5), (4.7) as well as the counterparts in the definition of in Appendix D (Note that quite a few of these expectations are simply zero due to independence of samples). Exploiting symmetry the same can be done for (4.9) and (4.10). In principle, these higher-order normal moments can all be obtained by repeatedly applying Isserlis’s theorem (Theorem B.2) laboriously. With symbolic computational softwares such as mathematica they can however be much more effortlessly computed. These computations lead to
[TABLE]
and further details are given in Appendix D. As a direct consequence of Hoeffding (1948)’s classical result on the variance of U-statistics, we also have the bound
[TABLE]
where
[TABLE]
and the functions , , are defined as
[TABLE]
Hence, forming estimates of the quantities can lead to an estimate of .
Lemma 4.3** (Bound for the ’s).**
[TABLE]
Again, proving these estimates involves repeatedly applying Theorem B.2 with the help of mathematica and the details will be deferred to Appendix D. We note that these estimates are by no means sharp, but suffice for our purpose. Putting Lemma 4.3 and (4.14) together, it is a routine task to check that
[TABLE]
for any . This, together with (4.4) and (4.13), proved Lemma 4.1.
5. Conclusion
In this paper, we studied the exact power of the Rao’s score statistic for testing independence, under the asymptotic regime where both the dimension and sample size grow to infinity when the ratio is bounded. A consequence of our main result is that the Rao’s score test is minimax rate optimal under this regime, with respect to a signal size of order .
While previous related work (Chen and Shao, 2012) on the null theory only requires the random variables to have finite moments, our power analysis relied on the normality assumption in different ways. Via applications of the Isserlis’ theorem on normal moments (Theorem B.2), all the higher moment quantities appeared in the calculations for the terms and in Sections 3 and 4 can be controlled in terms of , a second moment quantity in the original variables per se. It is thus conceivable that one can replace normality with appropriate higher moment conditions by carefully keeping track of these calculations. The tail bound for in Lemma 2.5 relies on a maximal inequality applicable to sub-exponential random variables, which is true for the centered sample covariances when they are formed with normal data (see Appendix A). When normality cannot be assumed, we expect that one can use more general maximal inequalities such as Chernozhukov et al. (2015, Lemma 8) along with their consequential moment conditions. A final caveat for pursuing the non-normal generality is that one should consider the more common definition of the sample covariance in (2.4) when constructing their Pearson correlations. Comparing (2.3) with (2.4), the insertion of sample means will likely complicate the calculations to follow under our current proof strategy.
Acknowledgments
We thank the referees for their valuable comments and suggestions. Qi-Man Shao’s research is partially supported by the grant Hong Kong RGC GRF14302515.
Appendix A Probability tail bound of
We will prove the tail bound for in Lemma 2.5. For , by a standard trick (Bickel and Levina, 2008, p.221), for any , one can show the sub-exponential inequality
[TABLE]
under our assumptions at the beginning of Section 2. Then by the maximal inequality in van der Vaart and Wellner (1996, Lemma 2.2.10) and a union bound, we have for any ,
[TABLE]
Note that by the definition of ,
[TABLE]
for . If , for all it must be true that
[TABLE]
since Combining (A.1), (A.2), (A.3), with probability larger than
[TABLE]
for large .
Appendix B Technical tools
In this section we will lay out the technical tools required to finish the proofs in the paper.
Lemma B.1**.**
Let be as defined in (2.2). For any
[TABLE]
Theorem B.2** (Isserlis (1918)).**
For any natural number , let be a mean zero normal vector with covariance matrix . Then
[TABLE]
where the summation is over all possible partitions of the indices into pairs .
Corollary B.3**.**
For any four indices ,
[TABLE]
Proof.
A simple corollary of Theorem B.2. ∎
Lemma B.4**.**
For a fixed natural number , suppose , are any pairs of variable indices. Then
[TABLE]
where the term is uniform for all choices of , .
Proof.
On expansion,
[TABLE]
so we only need to show the term in on the right hand side above is a uniform term. We note that, by independence, an expectation on the right hand of the preceding display can only be non-zero if
[TABLE]
One way that (B.1) may happen is when there is a permutation such that
[TABLE]
There can at most be many combinations of satisfying (B.2) since when (B.2) is true, the set can at most have elements leaving us with many choices for the combination of . We note that when a configuration in (B.1) is such that the set has cardinality exactly equal to ,
[TABLE]
by Corollary B.3. One can also easily see that there are at most many combinations of other than ones satisfying (B.2) that can lead to (B.1). Hence by Theorem B.2 and our assumption at the beginning of Section 2, we have
[TABLE]
where the is uniform for all choices of . ∎
The next two lemmas on sums of products of the entries in the population correlation matrix are keys for finishing our proofs.
Lemma B.5**.**
Suppose is a particular permutation of the four indices , say, . The following estimates are true:
[TABLE]
Proof of Lemma B.5.
With a slight abuse of notations, the expression “” means that is a number that is not equal to nor .
By the fact that for all ,
[TABLE]
which proves (B.4). Similarly,
[TABLE]
where the last inequality comes from a similar proof as the one for (B.4). ∎
Lemma B.6**.**
For ,
- (i)
[TABLE] 2. (ii)
If and are two fixed permutations of the eight indices . For instance, can be equal to, say, . Then
[TABLE]
Proof of Lemma B.6.
We first note that for , By the inequality that for all , we have
[TABLE]
hence to show it suffices to show
[TABLE]
Given , when of the indices are distinct, it must be the case that there exist pairs of such that all indices from these pairs are distinct elements from . Without lost of generality we can assume these pairs to be , which contains a total of distinct indices, and for proving and (B.6) it suffices to show, respectively,
[TABLE]
and
[TABLE]
As all the ’s are bounded in absolute value by , summing over the other indices different from results in a term which gives
[TABLE]
Now since
[TABLE]
by standard norm inequality, evaluating the sum on the right hand side of (B.9) we further obtain
[TABLE]
which is exactly (B.7). Similarly, summing over the other indices different from on the left hand side of (B.8) results in a term and hence
[TABLE]
Since , we get (B.8) by continuing from the preceding display. ∎
Appendix C Proofs for Section 3
C.1. Proof of (3.7)-(3.9)
First, we show the estimates in (3.7)-(3.9). Note that by Corollary B.3,
[TABLE]
for . Also recall that for all . We will analyze the sum in (C.1) for different .
(3.7): When , with for , it must be that and , and hence from (C.1)
[TABLE]
since and .
(3.8): When , one possible configuration of as a set with cardinality is that
[TABLE]
Taking a sum just over the terms in (C.1) whose indices satisfy the configuration (C.2) we get
[TABLE]
where the second last inequality is true because we enlarged the set of indices we are summing over and used the fact that
[TABLE]
since any is less than , and the last inequality follows from (B.5) and that
[TABLE]
The same estimates can be proved for other set configurations of similar to the one in (C.2). Since there are only finitely many such configurations, we get the estimate in (3.8).
(3.9): By considering different configurations for the set with cardinality , from (C.1) we have
[TABLE]
where the last inequality used (B.4) and
[TABLE]
C.2. Proof of (3.17)
In fact, the strategy we used in proving (3.8) will also lead to a quick proof of the estimates for , in (3.17). Be definition,
[TABLE]
By expanding the product at the end of the above equation and taking expectation with respect to Theorem B.2, one can see that
[TABLE]
where here we interpret as a permutation of the eight indices . When the permutation , we have
[TABLE]
by Lemma B.6. Although (C.4) is only proved for , a same bound for all other permutations easily generalize, which gives our estimate in (3.17) in light of (C.3).
C.3. Proof of (3.27)-(3.29)
(3.27): We first write
[TABLE]
where the term comes from a remaining sum of many universally bounded terms when . By the definition of in (3.20), Corollary B.3 and Lemma B.6, it can be seen that for each ,
[TABLE]
giving (3.27) in light of (C.5).
(3.28) and (3.29): Similar to (C.5) for , we can write
[TABLE]
By Corollary B.3 we get that is a finite sum of terms each having the form
[TABLE]
for and that are certain permutations of the indices . As such, by Lemma B.6, for a given ,
[TABLE]
Given (C.6) and (C.3) it remains to show
[TABLE]
and, for ,
[TABLE]
to prove (3.28) and (3.29). To that end we make the following claim:
Claim. Suppose and are two given permutations of eight indices . Then
[TABLE]
unless, as elements in ,
[TABLE]
for all when for all and .
The proof of this claim will be left till the end of this section. Using this, we will first show (C.9) for while the proof for follows similarly and is thus omitted.
By Corollary B.3, on expansion we get that the is a finite sum of terms each having the form as in the left hand side of (C.10) with and NOT satisfying the description in (C.11) of the claim. For example, by Corollary B.3, on expansion
[TABLE]
[TABLE]
which leads to
[TABLE]
where
[TABLE]
and
[TABLE]
and similar terms are omitted in above. Note that when and , there must be a pair among that contains two distinct elements in due to a mismatch of the permutations and : For if not in consideration of it must be the case that , , and with being four distinct elements in , but this will imply , a contradiction. By the claim above the first term on the right hand side of (C.12) equals to , where as the finitely many omitted terms in (C.12) can also be similarly bounded and (C.9) is proved.
We now show (C.8), again with Corollary B.3, we expand
[TABLE]
where we leave it to the reader to check that the omitted terms in of (C.15) is of order due to mismatch of permutations as in (C.13) and (C.14). In fact, summing over the three terms on the second line of (C.15) also contribute a term of order : For example, summing over the last term on the second line of (C.15) equals
[TABLE]
with
[TABLE]
When and , we cannot have for all and hence by the previous claim (C.16) is of order . Hence it remains to show that summing over the terms on the first line of (C.15) gives
[TABLE]
When with for all , as a set can take the configuration
[TABLE]
When (C.18) is true, , and hence
[TABLE]
For any configurations of the set other than (C.18), one of
(i) , (ii) , (iii) or (iv) must be true. For example, one such configuration is
[TABLE]
For this particular configuration, is true. Then we leave it to the reader to verify that by the same line of reasoning as in the proof of the claim below, we can show
[TABLE]
where similar bounds can in fact be proved for all configurations of other than (C.18). This, togethers with (C.19), leads to (C.17).
Proof of the claim.
Suppose (C.11) is not true for some , and without loss of generality we assume . Since for all , we have
[TABLE]
aa desired. ∎
Appendix D Proofs for Section 4
Before finishing the proofs for Section 4, we first give the definition of the kernel as mentioned in the main text:
[TABLE]
We now proceed with the remaining proofs.
Proof for Lemma 4.2.
Note that by definition,
[TABLE]
Since there are only finitely many we are summing over for the second term in (D.1), by the general fact that , it suffices to show that, for with and , the quantities
[TABLE]
as well as
[TABLE]
can be bounded by the right hand side of (4.3) up to some multiplicative constants. We will first show it for (D.2) case by case according to the multi-index degree of . The arguments rely on the fact that, by Lemma B.1, it must be true that
[TABLE]
and
[TABLE]
for some constant . Consider cases:
or : With the facts in (D.4) and (D.5), with Lemma B.4, (D.2) is less than
[TABLE]
Respectively, by properties of norms they can be estimated by
[TABLE]
which are both less than the right hand side of (4.3) up to constants since or .
: The only with and are , , . When , by (D.5) and Lemma B.4 the second moment quantity in (D.2) is bounded by
[TABLE]
less than the right hand side of (4.3). When , by Lemma B.1, (D.2) equals
[TABLE]
where the second equality comes from the fact that
[TABLE]
due to the i.i.d.’ness of samples and Corollary B.3. To show the last equality, by exploiting symmetry it is easy to see that
[TABLE]
Observe that
[TABLE]
In light of Lemma B.5, applying these bounds to (D.8) implies (D.7).
: The only ’s with and are and . For the first three of these since , by (D.5) and Lemma B.4 the quantity in (D.2) equals
[TABLE]
For , with Lemma B.1 the quantity in (D.2) equals
[TABLE]
By simple argument as in the proof of Lemma B.4 and Corollary B.3, it is not hard to see that
[TABLE]
Substituting (D.11) into (D.10) we get
[TABLE]
where the last two inequalities make use of similar arguments that prove (D.7). By a symmetry argument the same estimate holds for . Both (D.9) and (D.12) are less than the right hand side of (4.3).
It remains to form an estimate for (D.3). Note that
[TABLE]
where the term comes from an argument similar to the proof of Lemma B.4, and the term comes from that the many choices for when . Hence it now suffices to show the first term on the right hand side of the preceding display is less than the RHS of (4.3). The argument is simple but a little tedious so we just sketch it here: By a similar argument as in the proof of Lemma B.4 we must have
[TABLE]
where the expectations on the right come from the fact that must pair with one of , , , and as in (B.3). By Corollary B.3, for equals , or , it must be that
[TABLE]
for equals or , it must be that
[TABLE]
Substituting (D.14) and (D.15) into (D.13) gives that
[TABLE]
which gives us an estimate less than the one required. ∎
Proof of (4.13) and Lemma 4.3.
As described by the main text, with the help of the Expectation function provided by mathematica, we easily find that
[TABLE]
[TABLE]
[TABLE]
for each pair . Collecting these and summing over all gives the expectation of the kernel in (4.13). We will now prove Lemma 4.3, first dealing with (4.19). Note that simply equals the kernel function , in particular for a set of distinct sample indices we have
[TABLE]
by collecting the terms in the definition of , where is just a fixed polynomial in whose form is irrelevant to us. Using the fact that for all , we have
[TABLE]
A key observation is that upon squaring and taking expectation, must be a sum of finitely many terms, where for some sample indices , each of these terms can be “” bounded by the form
[TABLE]
where for any sample index and variable indice , and may equal to one of
[TABLE]
Now if , the form in (D.17) equals zero. If , the form in (D.17) can be bounded by
[TABLE]
and by applying Corollary B.3, we leave it for the reader to check that the leading term in the preceding display must be “” bounded by . Summarizing this gives us the bound in (4.19).
Now we get to (4.16)-(4.18). The functions , for the kernel can be found by simply conditioning and taking expectation using Theorem B.2. With the help of mathematica, they are found to be
[TABLE]
[TABLE]
and
[TABLE]
Above, and are simply three fixed polynomials in their respective arguments, and their forms are irrelevant to us. , and simply collect the terms that do not involve and , respectively. Note that by the same fact that for ,
[TABLE]
Note that in the definition of , there is a leading factor of order . By applying Theorem B.2 with the help of mathematica, we find that is a finite sum of terms each, up to a factor of order , can be bounded by one of the forms:
[TABLE]
We leave it for the reader to check that with the two estimates in Lemma B.5 and the familiar trick of decomposing a sum according to the cardinality of an index set as in (3.5), the forms in (D.21) can all be bounded by
[TABLE]
up to constants, and hence from (D.18) we obtain the estimate for in (4.16). By the same token, with the help of Mathematica we observe that
[TABLE]
again by the index set decomposition trick and Lemma B.5 we have
[TABLE]
Collecting (D.19),(D.20), (D.22) and (D.23) gives us (4.17) and (4.18).
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anderson (2003) Anderson, T. W. (2003). An introduction to multivariate statistical analysis . Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition.
- 2Bickel and Levina (2008) Bickel, P. J. and Levina, E. (2008). “Regularized estimation of large covariance matrices.” Ann. Statist. , 36(1): 199–227.
- 3Cai and Jiang (2011) Cai, T. T. and Jiang, T. (2011). “Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices.” Ann. Statist. , 39(3): 1496–1525.
- 4Cai and Ma (2013) Cai, T. T. and Ma, Z. (2013). “Optimal hypothesis testing for high dimensional covariance matrices.” Bernoulli , 19(5B): 2359–2388.
- 5Chen and Shao (2012) Chen, Y. and Shao, Q.-M. (2012). “Berry-Esseen inequality for unbounded exchangeable pairs.” In Probability approximations and beyond , volume 205 of Lecture Notes in Statist. , 13–30. Springer, New York.
- 6Chernozhukov et al. (2015) Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). “Comparison and anti-concentration bounds for maxima of Gaussian random vectors.” Probab. Theory Related Fields , 162(1-2): 47–70. URL http://dx.doi.org/10.1007/s 00440-014-0565-9 · doi ↗
- 7Han and Liu (2014) Han, F. and Liu, H. (2014). “Distribution-free tests of independence with applications to testing more structures.” ar Xiv preprint ar Xiv:1410.4179 .
- 8Heyde and Brown (1970) Heyde, C. C. and Brown, B. M. (1970). “On the departure from normality of a certain class of martingales.” Ann. Math. Statist. , 41: 2161–2165.
