Universality for eigenvalue algorithms on sample covariance matrices
Percy Deift, Thomas Trogdon

TL;DR
This paper establishes a universal statistical behavior for the iteration count of eigenvalue algorithms applied to random sample covariance matrices, providing complexity estimates that hold with high probability.
Contribution
It proves a universal limit theorem for the halting time of key eigenvalue algorithms on sample covariance matrices, linking algorithm complexity to random matrix theory results.
Findings
Universal limit theorem for halting time of eigenvalue algorithms
High-probability complexity estimates for random covariance matrices
Application of eigenvalue and eigenvector statistics to algorithm analysis
Abstract
We prove a universal limit theorem for the halting time, or iteration count, of the power/inverse power methods and the QR eigenvalue algorithm. Specifically, we analyze the required number of iterations to compute extreme eigenvalues of random, positive-definite sample covariance matrices to within a prescribed tolerance. The universality theorem provides a complexity estimate for the algorithms which, in this random setting, holds with high probability. The method of proof relies on recent results on the statistics of the eigenvalues and eigenvectors of random sample covariance matrices (i.e., delocalization, rigidity and edge universality).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Advanced Combinatorial Mathematics · Advanced Algebra and Geometry
Universality for eigenvalue algorithms on sample covariance matrices
Percy Deift
Courant Institute of Mathematical Sciences, New York University, 251 Mercer St., New York, NY 10012, USA
and
Thomas Trogdon
Department of Mathematics, University of California, Irvine, Irvine, CA 92697-3875, USA
Abstract.
We prove a universal limit theorem for the halting time, or iteration count, of the power/inverse power methods and the QR eigenvalue algorithm. Specifically, we analyze the required number of iterations to compute extreme eigenvalues of random, positive-definite sample covariance matrices to within a prescribed tolerance. The universality theorem provides a complexity estimate for the algorithms which, in this random setting, holds with high probability. The method of proof relies on recent results on the statistics of the eigenvalues and eigenvectors of random sample covariance matrices (i.e., delocalization, rigidity and edge universality).
Key words and phrases:
universality, eigenvalue computation, random matrix theory
2000 Mathematics Subject Classification:
15B52, 65L15, 70H06
The authors would like to thank Folkmar Bornemann for the data to display . This work was supported in part by grants NSF-DMS-1303018 (TT) and NSF-DMS-1300965 (PD).
1. Introduction
In this paper, we prove a universal limit theorem for the fluctuations in the runtime (or halting time) of three classical eigenvalue algorithms applied to positive-definite random matrices. The theorem is universal in the sense that the limiting distribution does not depend on the distribution of the entries of the matrix (within a class).
One can trace the search for universal behavior in eigenvalue algorithm runtimes to the largely-experimental work of Pfrang, Deift and Menon [20]. The authors considered three algorithms (QR, matrix sign and Toda) and ran the algorithms to the time of first deflation, which we now describe in more detail. Given an matrix , the algorithms produce isospectral iterates , , , and generically . Necessarily, the ’s are the eigenvalues of . However, one does not typically run the algorithm until the norm of all of the off-diagonal entries is small. Rather, one considers the submatrices which consist of the entries of that are in the first rows and the last columns. The -deflation times are defined as
[TABLE]
Here denotes the Frobenius norm111The authors in [20] actually considered a scaled -norm rather than the Frobenius norm.. Then the time of first deflation is given by . We define to be the largest value of such that . It follows that when , the eigenvalues of the leading and submatrices approximate the eigenvalues of to . The algorithm is then applied to the smaller submatrices, and so on.
A typical experiment from [20] goes as follows. Let and be matrices of iid standard normal and iid mean-zero, variance-one Bernoulli random variables, respectively. Then define and which are real, symmetric random matrices (see [6] for complex Hermitian matrices). After sampling the integer-valued random variables and , for large, say 10,000 times, we define the empirical fluctuations
[TABLE]
where and represent the sample mean and sample standard deviation, respectively. We plot the histograms for the empirical fluctuations in Figure 1. The histograms overlap surprisingly well for the two different ensembles, indicating that after centering and rescaling, the distribution of the time of first deflation is universal.
Proving theorems about the random variable is particularly difficult as one has to analyze the minimum of correlated random variables. In [8] the authors proved a (universal) limit theorem for the -deflation time of the so-called Toda algorithm. In this paper we prove an analog of that result for the -deflation time for the QR (eigenvalue) algorithm acting on positive definite matrices. This is an important first step in proving a limit theorem for because it is the most likely that , see Figure 1. We also include similar results for the power (P) and inverse power (IP) methods as the analysis is similar. For these two methods, we incorporate random starting vectors. The analysis and results of the current work are quite similar to that in [8], where we prove universality for the Toda eigenvalue algorithm, showing its wide applicability.
1.1. Relation to previous work and complexity theory
The statistical analysis of algorithms has been performed in many settings, usually with an eye towards complexity theory. In relation to Gaussian elimination, the seminal work is the analysis of Goldstine and von Neumann [13] on the condition number of random matrices. This is closely related to the later work of Edelman [9], also on condition numbers. The expected number of pivot steps in the simplex algorithm was analyzed by Smale [27] and Borgwardt [4]. The methodology of smoothed analysis was introduced in [28] and applied in a variety of settings [17, 19, 24].
The closest work, within the realm of complexity theory, to the current work is that of Kostlan [16]. Kostlan showed that for the power method on the expected halting time to compute an eigenvector is infinite. Kostlan showed that when one conditions on all of the eigenvalues being positive, the upper bound on the halting time is . Instead of conditioning, and eigenvector computation, we turn to sample covariance matrices which (with high probability) have positive eigenvalues and use the power methods to compute the extreme eigenvalues. With this we are able to determine the precise limiting distribution of the halting time, which contains far more information than simply an upper bound. To our knowledge, this is the first time this has been done for a classical numerical method.
For in the scaling region given by Condition 2.1, the halting times given in Theorem 1 scale like in order to obtain an accuracy of . This is a key conclusion of our results which gives an estimate on the complexity of the QR algorithm, and also the power and inverse power methods.
Through many detailed computations, universality in numerical computation has been observed in many numerical algorithms beyond the QR algorithm and the power and inverse power methods (see [8, 6, 7, 20, 23] ): the conjugate gradient algorithm, the matrix sign eigenvalue algorithm, the Toda eigenvalue algorithm, the Jacobi eigenvalue algorithm, the GMRES algorithm, a genetic algorithm and the gradient and stochastic gradient descent algorithms. This work presents further examples, in addition to [8], where one can prove this type of universality. This advances the contention of the authors that universality is a bona fide and basic phenomenon in numerical computation.
1.2. Open questions
The main open question related to this work is the asymptotics of the time of first deflation . A related and unknown detail is the tail behavior of the limiting distribution. As discussed in detail in [8], the limiting distribution in Theorem 1 in [8] for the halting time has one finite moment for real matrices and two finite moments for complex matrices. If one constructed an algorithm with a sub-Gaussian limiting distribution, it may be preferable. We believe this is the case for . We also believe that its distribution is related to the largest gap in the spectrum of the stochastic Airy operator [22]. Furthermore, can one extend our results for the QR algorithm to indefinite ensembles?
We only consider random matrices with entries that are exponentially localized, see 1. It is not known if this condition can be relaxed but it is itself an important open question. Finally, additional halting criteria can be employed. One could look for the time to compute eigenvectors with the power method, or to compute the entire spectrum with the QR algorithm.
2. Main results
In this paper we discuss computing the smallest and largest eigenvalues of random positive definite matrices to an accuracy . We have a basic condition that we enforce on which requires that is appropriately small.
Condition 2.1**.**
[TABLE]
for fixed.
For , we let be the iterates of the QR algorithm (QR, defined in Section 5.2) and and be the iterates of the power and inverse power methods, respectively (P and IP, respectively, defined in Section 5.1). We specify (discrete) halting times for these algorithms applied to a matrix with starting vector as follows:
[TABLE]
Note that for the QR algorithm the entry of , , is an approximation of the smallest eigenvalue as is . On the other hand, is an approximation of the largest eigenvalue . Our main results are summarized in the following Theorem and Propsosition. See Definition 1 for the definition of sample covariance matrices and Definition 3 for the distribution function . The constants and are given in (2).
Theorem 1** (Universality).**
Let be a real () or complex () sample covariance matrix and let be a (random) unit vector independent of . Assuming satisfies Condition 2.1, for
[TABLE]
This theorem is a direct consequence of Theorems 5 and 6, after noting that, for example, where appears in Theorem 5. This is a universality theorem in the sense that it states that for is sufficiently large the distribution of the halting time is independent of the distribution on .
The following Proposition shows that we obtain an accuracy of but not of , i.e. that our halting criteria are sufficient but not too restrictive. It is a restatement of Propositions 4 and 5.
Proposition 1**.**
Assuming satisfies Condition 2.1, for any real or complex sample covariance matrix
[TABLE]
converge to zero in probability, while
[TABLE]
converge to in probability.
A numerical demonstration of Theorem 1 is given in Section 4.
The outline of the paper is as follows. In Section 3 we discuss the fundamental results of random matrix theory that are required to prove our results. In Section 4 we give a numerical demonstration of Theorem 1. Next, in Section 5, we discuss the fundamentals of the power methods and the QR algorithm before we apply the random matrix estimates in Section 6 to prove our results. In Appendix A we analyze the true error of the methods with our chosen halting criteria to see that these criteria are indeed appropriate to the task. Finally, in Appendix B we discuss the asymptotic normality of eigenvector projections of random vectors. This allows us to show that Theorem 1 indeed holds for random starting vectors in the power and inverse power methods.
3. Results from random matrix theory
We now introduce the ideas and results from random matrix theory that are needed to prove our main theorems. Let be an real or complex matrix with . We consider the ordered eigenvalues , of , . Let denote the absolute value of the last components of the associated normalized eigenvectors. We only consider sample covariance matrices from independent samples.
Definition 1** (Sample covariance matrix (SCM)).**
A sample covariance matrix (ensemble) is a real symmetric () or complex Hermitian () matrix , such that are independent random variables for , given by a probability measure with
[TABLE]
Next, assume there is a fixed constant (independent of ) such that
[TABLE]
For (when is complex-valued) the condition
[TABLE]
must also be satisfied.
We assume all SCMs have . Define the averaged empirical spectral measure
[TABLE]
where the expectation is taken with respect to the given ensemble. For technical reasons we let and satisfy . More specifically, we consider .
Remark 3.1**.**
The case where is of considerable interest: If then it is known that the limiting distribution of the smallest eigenvalue is given in terms of the so-called Bessel kernel [2, 10] when has Gaussian divisible entries. If , and are standard complex normal random variables then it is known that the smallest eigevalue has Tracy–Widom fluctuations [7]. It is noted in [21, Section 1.4] that establishing all estimates we use below in the case is a difficult problem. In light of the current work, this is a particularly interesting problem as it would give different scalings for the halting times.
Define the Marchenko–Pastur law
[TABLE]
and denotes the positive part. For SCMs, converges to weakly and is called the equilibrium measure for the ensemble (see, for example, [18, 21, 26, 29, 32]).
Definition 2**.**
Define to be the smallest value of such that
[TABLE]
Thus represent the quantiles of the equilibrium measure. We now describe conditions on the matrices that simplify the analysis of the algorithms QR, P and IP.
Condition 3.1**.**
For ,
- •
.
Let denote the set of matrices that satisfy this condition.
Condition 3.2**.**
For ,
- •
.
Let denote the set of matrices that satisfy this condition.
Given an SCM, let be a random (or deterministic) unit vector independent of the SCM. Define , where is the th eigenvector of the SCM.
Condition 3.3**.**
For any fixed ,
- (1)
* for all * 2. (2)
* for ,* 3. (3)
, for , 4. (4)
, for , and 5. (5)
* for all .*
Let denote the set of matrices that satisfy these conditions.
Remark 3.2**.**
Clearly the quantiles lie in the interval . Property (5) above implies, in particular, that for sufficiently large, the eigenvalues of matrices in lie in the interval for any given .
The analysis of the eigenvalues of sample covariance matrices has a long history, beginning with the work of Marc̆enko and Pastur [18]. The seminal work of Geman [12] showed that for , , the largest eigenvalue of an SCM converges a.s. to . Silverstein [25] established that for , the smallest eigenvalue converges a.s. to when are iid standard normal random variables. See [10, 14, 15] for the first results on the fluctuations of the largest and smallest eigenvalues when are iid (real or complex) standard normal distributions. Universality for the eigenvalues of at the edges and in the bulk, was first proved by Ben Arous an Peché [2] for Gaussian divisible ensembles, in the limit , , fixed. We reference [21] and [3] for the most comprehensive results. Note that we require (1) which is stronger than the assumptions in [12, 32] which only require moment conditions. Various limits of the eigenvectors have also been considered, see [1, 26]. But we reference [3] for the full generality we need to prove our theorems.
Theorem 2**.**
*For SCMs *
[TABLE]
and
[TABLE]
separately converge jointly in distribution to random variables which are the smallest three eigenvalues of the so-called stochastic Airy operator. Furthermore, are distinct with probability one.
Proof.
The first statement follows from [3, Theorem 8.3]. The second statement follows from [21, Theorem 1.1 & Corollary 1.2]. The fact that the eigenvalues of the stochastic Airy operator are distinct is shown in [22, Theorem 1.1]. ∎
Definition 3**.**
The distribution function , supported on for is given by
[TABLE]
The remaining theorems in this section are compiled from results that have been obtained recently in the literature. We use a simple lemma (see, for example, [8, Lemma 3.2]):
Lemma 1**.**
If in distribution222For convergence in distribution, we require that the limiting random variable satisfies . as then for any
[TABLE]
as provided that .
Theorem 3**.**
For SCMs, Condition 3.3 holds with high probability as , that is, for any
[TABLE]
as .
Proof.
It suffices to show that each of the sub-conditions 1-5 in Condition 3.3 hold with high probability. Conditions 3.3.1-2 hold with high probability directly by Proposition 6. Conditions 3.3.3-4 hold with high probability by the joint convergence of the top (bottom) three eigenvalues in Theorem 2 and Lemma 1. Finally, Condition 3.3.5 holds with high probability as a direct consequence of [21, Theorem 3.3]. ∎
Theorem 4**.**
For SCMs,
[TABLE]
Proof.
It follows from Theorem 2 that
[TABLE]
Then
[TABLE]
But from [22, Theorem 1.1] . And so, it suffices to show that
[TABLE]
This will, in turn follow, if we show that
[TABLE]
converges to zero in probability for fixed. We set where converges jointly in distribution by Theorem 2. Let be the event and for consider
[TABLE]
Given , we perform a formal expansion
[TABLE]
Therefore, given , tends to zero uniformly and we find
[TABLE]
Because of joint convergence (in distribution) of , the right-hand side tends to zero as . This establishes the result for . Similar considerations yield the result for . ∎
4. A numerical demonstration
We include some numerical simulations that serve to demonstrate Theorem 1. We include ideas that were discussed in detail in [8]. En route to proving Theorem 1 we perform the following approximation step for A = QR, IP or P
[TABLE]
where is given in (7) and (14) below. The difference is always less than unity and the difference is (see Proposition 2, for example). Then converges in distribution, after rescaling, to but it is clear from the proof of Theorem 6 that the rate of covergence is logarithmic, at best. To improve the rate we note that
[TABLE]
for any constant . Here is taken if A = P and is taken if A = QR, IP. We choose (cf. with chosen in [8]), using (7), by
[TABLE]
After examining (14), we choose
[TABLE]
Then changing and in (14) we choose
[TABLE]
Despite the fact that these ’s are not constant, from Theorem 2 one should expect they have well-defined limits as . These effective constants can be easily approximated by sampling the associated matrix distributions.
In Figure 2 we demonstrate (3) and hence Theorem 1 for the QR algorithm. Figures 3 and 4 demonstrate the analogous results for the inverse power method and power method, respectively. The ensembles we use are the following:
- LOE
: (in Definition 1 below) has iid standard real Gaussian entries,
- LUE
: has iid standard complex Gaussian entries,
- BE
: has iid mean-zero, variance-one Bernoulli entries ( with equal probability),
- CBE
: has iid mean-zero, variance-one complex Bernoulli entries ( , , with equal probability)
The density was computed by the authors in [8]. We sample the matrix distributions for large and use appropriate interpolation. The density was computed in [31] (and rescaled in [8]) and the data to reproduce it here was provided by the authors of that work.
Finally, in Figure 5 we show the statistics of the time of first deflation, as defined in the introduction, for LOE and BE when . This demonstrates universality for the time of first deflation but the limiting distribution (whatever it may be!) is clearly distinct from both histograms in Figure 1 and the limiting distribution in Theorem 1. And so, computing the limiting distribution for the rescaled time of first deflation requires information about much more than just the -deflation time.
5. Fundamentals of the algorithms
Here we discuss the QR algorithm and power/inverse power methods. We derive explicit formulae to analyze the halting times of the algorithms.
5.1. The power and inverse power methods
Let be a sequence of independent, real, mean-zero, and variance-one random variables. The power and inverse power methods with random starting are given in Algorithms 1 and 2.
The power method (see Algorithm 1 above) is halted when successive approximations have a difference that is less than . Our analysis reveals (see Proposition 1 and Remark A.2) that typically is less than the true error and so one has to run until the difference is . Similarly, the inverse power method is given by Algorithm 2 below where we use the convention .
Let , be a spectral decomposition for the matrix . A random unit vector is given by
[TABLE]
for the given random variables . With the inverse power method, at each iteration, we have
[TABLE]
For the power method we have
[TABLE]
5.1.1. The halting time
We define the halting time for the inverse power method as
[TABLE]
Provided that the smallest eigenvalue of is order , this halting condition will give the same order of approximation in as a possibly more natural condition . We choose (4) for convenience and show it is sufficient. Similarly, the halting time for the power method is
[TABLE]
Define the function
[TABLE]
Using the notation , , we have
[TABLE]
Note that
[TABLE]
Remark 5.1**.**
We focus on the inverse power method here. There is an anlogous function for the power method which can be found through the mapping . And so, if we can estabilish properties of under assumptions on that also satifies, the properties extend to .
5.2. The QR (eigenvalue) algorithm
Unlike the power and inverse power methods, the convergence criterion for the QR algorithm (without shifts) is much more subtle even though convergence is guaranteed for the matrices we consider [11]. We consider a general error control function , see below. The basis of the algorithm is the QR factorization of a non-singular matrix. We use () to denote this factorization where is unitary and is upper-triangular with positive diagonal entries. The QR factorization can be found via the modified Gram–Schmidt procedure or Householder reflections, for example. It is unique when it exists. The QR algorithm is given by the following steps:
Provided that is suitably chosen, (a subset of) the diagonal entries of will be an approximation of eigenvalues of . We develop a more analytically tractable description of the QR algorithm. For a positive-definite matrix , let denote its th power, . Define , and via
[TABLE]
For the QR algorithm we are interested in but for additional remarks we want to consider . And so, it is important to note that and are infinitely differentiable matrix-valued functions of .
It is well known that , gives the iterates of the QR algorithm, with, of course, . For the convenience of the reader, we provide the following standard proof.
Lemma 2**.**
For all , .
Proof.
Using induction, the QR algorithm is described as
[TABLE]
Then we consider the QR factorization of
[TABLE]
It then follows that by the uniqueness of the QR factorization. Therefore . ∎
Let , be a333Note that is not uniquely defined. Furthermore, if the spectrum is not simple then is not even uniquely defined modulo phases. spectral decomposition of . Then define so that . We first compute , by considering ( is the th canonical basis column vector and )
[TABLE]
And so, to determine , we sum over and use the normalization of the rows of :
[TABLE]
When it comes to the choice of the function in Algorithm 3, we first give two options that we do not analyze but are of great interest:
- •
Compute the entire spectrum: . Here is a diagonal matrix containing just the diagonal of and is the Frobenius norm.
- •
Deflation444Here refers to the submatrix containing entries in rows through and columns through .:
[TABLE]
For our purposes here we choose as
[TABLE]
This is the sum of the off-diagonal entries in the last row of . And so, if is small then is close to an eigenvalue of . Continuing,
[TABLE]
Remark 5.2**.**
It is worth emphasizing that , the interpolation of the QR iterates , is the solution of a nonlinear differential equation [5]. Furthermore, in the real symmetric case, this is generically a system in variables that is Hamiltonian and completely integrable. The eigenvalues of , constitute of the integrals of the motion, i.e. the flow is, in particular, isopectral. The equations of motion are given by
[TABLE]
where is the (strictly) lower-triangular part of . The Hamiltonian is given by . See [5] and [30] for more details.
5.2.1. The halting time
We define the halting time for the QR algorithm as
[TABLE]
Note that we do not assume here that is an integer. The “true” halting time for the QR algorithm is but it will turn out that this has the same limiting distribution as .
The first step in the analysis of the QR algorithm is to write as a sum of two positive parts, as follows. Define for
[TABLE]
Then
[TABLE]
It is clear that for all and we isolate this term:
[TABLE]
Heuristically, is quadratic in and is not. Therefore should provide the leading order behavior of as provided that the ’s are not too large. Note that by the Cauchy–Schwartz inequality, .
6. Proofs of the main theorems
In order to prove our main theorems we take the following approach. The dynamics of the QR algorithm closely mirrors that of the so-called Toda algorithm and therefore many of the results of [8] apply directly. And to prove Theorem 1 for the QR algorithm we almost exclusively simply quote results from [8]. To prove Theorem 1 for the power and inverse power methods, we discuss the calculations in more detail.
For convenience let . Then Condition 2.1 takes the form
[TABLE]
with and fixed.
6.1. Technical lemmas
We begin by modifying the technical lemmas from [8] as our formulae now depend on the ratio of eigenvalues as opposed to their differences in [8]. The main fact is that if a matrix satisfies Condition 3.3 with then so does with quantiles provided is sufficiently large. Indeed for
[TABLE]
for sufficiently large. Applying Condition 3.3(5) with replaced by , we conclude that for sufficiently large for all . Concerning Condition 3.3(3), note that for sufficiently large
[TABLE]
and one then proceeds as before. The proof of Condition 3.3(4) is similar.
Recall the notation and define for .
Lemma 3** ([8]).**
Let . Given Condition 3.3, then the cardinality of is given by
[TABLE]
for sufficiently large, where c denotes the compliment relative to .
Recalling the notation , for matrices in we have and because for the unitary matrix of eigenvectors. We also have the following result.
Lemma 4** ([8]).**
Given Condition 3.3, and fixed there exists an absolute constant such that
[TABLE]
for sufficiently large.
6.2. Main estimates for the QR algorithm
The steps of the proof are the following:
- (1)
a priori estimates on that will hold with high probability, 2. (2)
a lower bound on over a region determined in (1), 3. (3)
finding and estimating an approximation of , and 4. (4)
establishing that converges in distribution and then using (1)-(3) to show that indeed is close to .
If is sufficiently small we expect to control the convergence of the algorithm. Consider
[TABLE]
and the approximation of is given by
[TABLE]
Thus
[TABLE]
To determine how close is to we use the following relation
[TABLE]
for some between and . So, we need to show that the left-hand side of (8) is small and is not too small. This is accomplished by following Lemmas 5-9 and Proposition 2. The the proofs of these results make heavy use of Lemmas 3 and 4.
Lemma 5** ([8], Lemma 2.1).**
Given Condition 3.3, the halting time for the QR algorithm satisfies
[TABLE]
for sufficiently large .
Define the interval
[TABLE]
Lemma 6** ([8], Lemma 2.2).**
Given Condition 3.3 and
[TABLE]
for sufficiently large .
The next estimate is immediate from the definition of
Lemma 7** ([8], Lemma 2.3).**
Given Condition 3.3
[TABLE]
for sufficiently large , i.e. .
Lemma 8** ([8], Lemma 2.4).**
[TABLE]
for sufficiently large .
Lemma 9** ([8], (2.6)).**
Given Conditions 3.3, for
[TABLE]
for sufficiently large .
Proposition 2**.**
Given Conditions 3.1 and 3.3 for and fixed with sufficiently small (depending on and )
[TABLE]
for sufficiently large.
Proof.
We use (8) to estimate the difference (for some between and and apply Lemmas 5, 6, 7, 8 and 9 to find
[TABLE]
Using the assumption that the proposition follows. ∎
Thus far, no estimates had any probabalistic input. We now introduce the probabilistic considerations needed to prove our main theorem for the QR algorithm.
Theorem 5**.**
Let be an SCM. For ,
[TABLE]
Proof.
We first prove that the following three random variables converge to zero in probability:
[TABLE]
The proof for the first random variable follows [8, Lemma 3.1] and requires the specific use of Condition 3.1, the proof for the second follows [8, Lemma 3.4]. For the last, we write , where converges jointly in distribution. Let be the event where . Given , consider
[TABLE]
Then
[TABLE]
where the approximation is uniform as (given ). For , (sufficiently small) using uniform convergence
[TABLE]
Letting we establish that converges to zero in probability. Appealing to Definition 3 we finally have
[TABLE]
∎
6.3. Main estimates for the power/inverse power method
We now follow the same steps that were performed for the QR algorithm for the inverse power method. First, we establish as given in (6).
Define and use the notation . It follows that the non-negativity of is equivalent to
[TABLE]
From Jensen’s inequality for concave functions
[TABLE]
which gives
[TABLE]
The last inequality follows from another application of Jensen’s inequality (for convex functions).
Lemma 10**.**
Given Condition 3.3, the halting time for the inverse power method satisfies
[TABLE]
for sufficiently large .
Proof.
Because and are both positive we know that if on the interval then . We first estimate as follows:
[TABLE]
Then we set and use Lemma 4 to estimate the denominator
[TABLE]
To estimate the numerator we use Lemma 4 again
[TABLE]
Therefore
[TABLE]
Recall that , and assume that . We have
[TABLE]
which is larger than for sufficiently large , and we conclude
[TABLE]
For , note that
[TABLE]
Now choose (cf. Lemma 4) so that , i.e., . Note that as , such a exists. Furthermore, as , it follows from the above inequality that there exists such that
[TABLE]
for sufficiently large . Then (again for )
[TABLE]
Now for , we have , and so for , we again have for sufficiently large . This establishes the lower bound on .
To establish the upper bound on we use the absolute boundedness of (given Condition 3.3: see also Remark 3.2) for , together with
[TABLE]
If with , then . But then , and so, taking , . Hence . Hence . Thus
[TABLE]
So, for these values of , for sufficiently large . Next, we show that the same holds for . We use the estimate with and any , to obtain
[TABLE]
for and sufficiently large. Then using (given Condition 3.3), and taking , for and sufficiently large. Thus
[TABLE]
This shows that for large . ∎
Remark 6.1**.**
We take , rather than , for technical reasons, see Lemma 14 below.
Similar to the case of the QR algorithm, define
[TABLE]
Lemma 11**.**
Given Condition 3.3 and
[TABLE]
for sufficiently large.
Proof.
By direct calculation
[TABLE]
Then using (9) and and keeping only the leading term
[TABLE]
for . Define by . Then we use (10) and (11) with and
[TABLE]
for . The last inequality follows because and . From here it follows that for sufficiently large
[TABLE]
∎
Our next step is to construct an approximation of . We write
[TABLE]
Define by
[TABLE]
Lemma 12**.**
Given Condition 3.3, .
Proof.
Using Condition 3.3
[TABLE]
we find
[TABLE]
for sufficiently large , establishing the lemma. ∎
Lemma 13**.**
[TABLE]
for and sufficiently large .
Proof.
By direct calculation
[TABLE]
Since the denominator is at least unity, it is enough to estimate the numerators. As ,
[TABLE]
For , define . We estimate
[TABLE]
First,
[TABLE]
Here we used and estimated . If we set and use the inequality , then
[TABLE]
as . Second, using that and Condition 3.2 along with , (), we consider
[TABLE]
by Condition 3.3, and so . Since , we find
[TABLE]
for sufficiently large . By (12),
[TABLE]
for sufficiently large as . We find
[TABLE]
This proves the lemma. ∎
The next lemma is a restatement of (13)
Lemma 14**.**
Given Condition 3.3, for
[TABLE]
for sufficiently large .
Proposition 3**.**
Given Conditions 3.1 and 3.3 for and , fixed, with
[TABLE]
for sufficiently large.
Proof.
We use the analog of (8) to estimate the difference (for some ) and apply Lemmas 10, 11, 12, 13 and 14 to find
[TABLE]
The proposition follows. ∎
Now, we introduce probabilistic considerations as we did for the QR algorithm.
Theorem 6**.**
Let be an SCM and let be a random unit vector independent of . For ,
[TABLE]
Proof.
As was the case in the proof of Theorem 5, we show the following three random variables converge to zero in probability:
[TABLE]
We start with the first. For
[TABLE]
Provided that , on , tends to zero uniformly. Then
[TABLE]
From Theorem 3, , and letting , using Theorem 4 we find
[TABLE]
For the second random variable:
[TABLE]
Again, we write , and let be the event where . Next let be event where . It then follows for and sufficiently large
[TABLE]
Therefore
[TABLE]
And because and converge in distribution, if we let in (15) it follows that . The convergence in probability of the last random variable follows directly from the proof of Theorem 5. Using Definition 3 we have
[TABLE]
∎
Finally, we establish the analogous theorem for the power method. Following Remark 5.1, we note that is defined by sending and satisfies the same estimates as (Theorem 4 and Theorem 3). We have the following theorem.
Theorem 7**.**
Let be an SCM and let be a random unit vector independent of . For ,
[TABLE]
Appendix A Error analysis
In this section we establish that the halting times given above for the QR algorithm and the inverse power method are adaquate to acheive an order approximation of the smallest eigenvalue.
A.1. QR algorithm
The true error in the QR algorithm is
[TABLE]
Applying Lemma 4 (see also (10)) we find for , given Condition 3.3
[TABLE]
for sufficiently large . We obtain the following error estimate.
Proposition 4**.**
For , ,
[TABLE]
converges to zero in probability, while
[TABLE]
converges to in probability.
Proof.
First, given Condition 3.3, and for
[TABLE]
It then follows that on for sufficiently large and
[TABLE]
Therefore
[TABLE]
It then follows that on for sufficiently large and
[TABLE]
Therefore
[TABLE]
∎
Remark A.1**.**
Define the “true” halting time by
[TABLE]
We omit the details, but one can show that
[TABLE]
So that has the same limiting distribution as .
A.2. Inverse power method
The true error for the inverse power method is also given by
[TABLE]
Following the calculations that led to (16), given Condition 3.3, for sufficiently large and ,
[TABLE]
Here we are conservative with the factor on sot hat it mirrorw (16). An analogous formula holds for the power method. We arrive at the following propsition that is proved in the exact same way as Proposition 4.
Proposition 5**.**
For , ,
[TABLE]
converge to zero in probability, while
[TABLE]
converge to in probability.
Remark A.2**.**
Following, Remark A.1 define the “true” halting time by
[TABLE]
Again, omitting the details, one can show that
[TABLE]
So that has the same limiting distribution as . This further justifies the definition of .
Appendix B Asymptotic normality of the eigenvector projections
This section presents the estimates that are required to prove Theorem 1 for the power and inverse power methods when the initial unit vector is chosen randomly.
Theorem 8**.**
Let () or () be a unit vector555To be precise about this, fix a semi-infinite vector . Then for define . and fix . Let and be the eigenvectors of an SCM corresponding to and . Then for any bounded, continuous function
[TABLE]
where is either a standard normal () or a standard complex () random variable. That is, we have convergence in distribution to
Proof.
We present the proof for and as the other cases are completely analogous. From [3, Theorem 8.2] it follows that
[TABLE]
where is the expectation with respect to the Wishart (LOE) ensemble ( are iid standard normal random variables). And so, it is enough to prove the statement for the Wishart ensemble. In this case it is well known that the eigenvectors are distributed with Haar measure on the orthogonal group. Let be a vector of iid standard normal random variables. It follows that
[TABLE]
and is a standard normal random variable. And, so it suffices to show that
[TABLE]
Indeed, if the difference of two random variable converges to zero in probability and the first converges in distribution then so does the second (to the same distribution). Fix and consider for
[TABLE]
A consequence of the Strong Law of Large Numbers (SLLN) is that the latter term tends to zero as : The SLLN implies that a.s., hence a.s. and therefore in probability. Then letting , (17) follows. ∎
Corollary 1**.**
Theorem 8 holds when is a random unit vector, independent of the given SCM. In this case should be understood as the expectation with respect to both the distribution on and the SCM.
Proof.
We express . Let be a bounded, continuous function . Then Theorem 8 states
[TABLE]
By the bounded convergence theorem
[TABLE]
as , and the corollary follows. ∎
Proposition 6**.**
Given an SCM, let be a random666This also holds for deterministic . unit vector independent of the SCM. Define , where is the th eigenvector of the SCM. Fix and let be the set of matrices where
- •
* for all , and*
- •
* for .*
Then as , i.e. these conditions hold with high probability.
Proof.
The delocalization result [3, Theorem 2.17] states that for deteriministic unit vectors and all :
[TABLE]
This implies
[TABLE]
So, let . Stated another way,
[TABLE]
uniformly in . And so, taking an expectation with respect to the law of we find that
[TABLE]
Then for
[TABLE]
follows from Corollary 1 after applying Lemma 1.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z. D. Bai, B. Q. Miao, and G. M. Pan. On asymptotics of eigenvectors of large sample covariance matrix. Ann. Probab. , 35(4):1532–1572, 2007.
- 2[2] G Ben Arous and S Péché. Universality of local eigenvalue statistics for some sample covariance matrices. Commun. Pure Appl. Math. , 58(10):1316–1357, oct 2005.
- 3[3] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal components of sample covariance matrices. Probab. Theory Relat. Fields , 164(1-2):459–552, feb 2016.
- 4[4] K H Borgwardt. The simplex method: A probabilistic analysis . Springer–Verlag, Berlin, Heidelberg, 1987.
- 5[5] P Deift, T Nanda, and C Tomei. Ordinary differential equations and the symmetric eigenvalue problem. SIAM J. Numer. Anal. , 20:1–22, 1983.
- 6[6] P A Deift, G Menon, S Olver, and T Trogdon. Universality in numerical computations with random data. Proc. Natl. Acad. Sci. U. S. A. , 111(42):14973–8, oct 2014.
- 7[7] P A Deift, G Menon, and T Trogdon. On the condition number of the critically-scaled Laguerre Unitary Ensemble. Discret. Contin. Dyn. Syst. , 36(8):4287–4347, mar 2016.
- 8[8] Percy Deift and Thomas Trogdon. Universality for the Toda algorithm to compute the eigenvalues of a random matrix. ar Xiv Prepr. ar Xiv 1604.07384 , apr 2016.
