The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic
Percy Deift, Thomas Trogdon

TL;DR
This paper demonstrates that for large well-conditioned Wishart matrices, the conjugate gradient algorithm's iteration count becomes nearly deterministic, with error and residual norms converging rapidly in probability and almost surely.
Contribution
It establishes that the iteration count for conjugate gradient on large Wishart matrices is almost deterministic, providing explicit convergence results.
Findings
Iteration count concentrates around a deterministic value
Error and residual norms decay exponentially fast
Convergence occurs in probability, mean, and almost surely
Abstract
We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices where is and for . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration count, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic
Percy Deift
New York University
Courant Institute of Mathematical Sciences
251 Mercer St.
New York, NY 10012
and
Thomas Trogdon
University of Washington
Department of Applied Mathematics
Seattle, WA 98195-3925
Abstract.
We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices where is and for . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration count, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.
2010 Mathematics Subject Classification:
Primary: 65F10, 60B20
We are grateful for discussions with Elliot Paquette, Joel Tropp and Roman Vershynin that have greatly improved the paper. This work was supported in part by NSF DMS-1300965 (PD) and NSF DMS-1753185, DMS-1945652 (TT)
1. Introduction
The conjugate gradient algorithm (CGA) [HS52] is arguably the most effective iterative method from numerical linear algebra. In exact arithmetic, the algorithm requires at most iterations to solve a positive-definite linear system and it often requires many less iterations to compute a good approximate solution. It is exceedingly simple to implement and there are well-known error bounds available. And, despite the fact that the CGA is sensitive to round-off errors these error bounds still effectively hold for floating point arithmetic [Gre89]. While we present the algorithm in full below (see Algorithm 2.1), the variational characterization of the method is summarized as follows: Consider the linear system , . Given an initial guess , find the unique vector that satisfies
[TABLE]
At each step of the iteration one can easily construct and the algorithm itself computes , . One has to then determine a computable stopping criterion, and typically, the algorithm is halted when , , for a chosen error tolerance
Here we focus on two main measures of the error, :
[TABLE]
in the particular case when . The associated halting times are
[TABLE]
We emphasize the importance of analyzing both quantities because is what is observed throughout the iteration and, of course, is the true error.
Our results (Theorems 3.1, 3.2, 3.3) are derived for both real and complex Gaussian matrices 111We can also easily extend the results to the case of quarternion entries.. We assume
[TABLE]
where is an matrix whose entries are iid real or complex standard normal random variables. This is the real or complex Wishart distribution. Suppose further that for (Note that if , i.e. , then is singular and does not have a unique solution). Then if b is a random unit vector, independent of , our results show that as
[TABLE]
If then as
[TABLE]
Furthermore, there are discrete sets and with the property that if is in the complement of these sets, fixed, then
[TABLE]
Therefore, the halting time becomes effectively deterministic. We also present estimates that demonstrate that the probability that the errors deviate from their means decays exponentially with respect . In the case , a consequence of our results is that for any fixed and ,
[TABLE]
Remark 1.1*.*
It is important to point out that in (1.2) is not necessarily a near-identity matrix. Indeed as , the eigenvalues of typically lie in the interval
[TABLE]
and have an asymptotic density given by the famous Marchenko–Pastur law, see Definition 2.1. For finite , some of the eigenvalues of lie outside this interval, and the control of these eigenvalues plays a crucial role in the proofs of Theorems 3.1, 3.2 and 3.3 (see, for example, the proofs of (4.5) and (4.9)).
Our proofs make critical use of the invariance of the Wishart distribution and the relation between Householder bidiagonalization and the Lanczos iteration. This allows one to use classical estimates on chi-distributed random variables in a crucial way. The specific tools and results we incorporate from random matrix theory include global eigenvalue estimates [DS01], the convergence of the empirical spectral measure [BMP07] and the central limit theorem for linear statistics [LP09].
The remainder of the paper is setup as follows. In Section 1.1 we compare our analysis with facts already known about the conjugate gradient algorithm. We also demonstrate our results with numerical examples. In Section 2 we introduce our random matrix ensembles, the basic definitions from random matrix theory and review the Householder bidiagonalization procedure applied to these ensembles. We also review the connections between the conjugate gradient algorithm, the Lanczos iteration and the Householder bidiagonalization procedure. In Section 3 we present our main theorems. In Section 4 we introduce the results from probability and random matrix theory that are required to prove our theorems. In Section 5 we give the proofs of the theorems.
1.1. Comparison and demonstration
We now give a demonstration and discussion of the results. In what follows denotes the sample average of a random variable using samples. We will refer to the matrix
[TABLE]
where is an matrix, having iid entries, with equal probability, as the Bernoulli ensemble (BE).
1.1.1. A numerical demonstration
To demonstrate our main results, in Figure 1 we plot the following quantities as a function of for different values of
[TABLE]
with error bars that indicate where % of the samples lie. In Figure 2 we plot the same statistics for
[TABLE]
Both Figure 1 and Figure 2 demonstrate the concentration of the errors about their means. We demonstrate the limiting behavior of the halting times in Figure 3.
In all of these figures we have included computations for distributions of random matrices, in particular the Bernoulli ensemble, that are beyond the class for which our results apply. Nonetheless, it is clear that the behavior persists. This universality will be investigated in future work.
1.1.2. Relation to previous work
The classical error estimate for the CGA is [HS52, Gre89]
[TABLE]
where is the condition number of . Here are the eigenvalues of . It is a classical result in random matrix theory [BY93] that the condition number of (1.2) converges almost surely to . Roughly, one then obtains
[TABLE]
which is often just simplified to
[TABLE]
This overestimates the actual error by just a factor of 2.
In [MT16], the authors used (1.3) and tail bounds on the condition number to estimate the halting times (1.1) in the case . A key observation was that the actual number of iterations appears to be of the same asymptotic order as the estimate obtained using (1.1). This is something that will indeed be true if the error estimate used decays exponentially and turns out to be an overestimate by a constant factor.
Remark 1.2*.*
Of particular interest is this case where depends on and as . For example, was seen in [DMT16, DMOT14] to produce universal fluctuations for the halting times. Similarly, one would want to treat the case as
Remark 1.3*.*
Our calculations in this work apply only to matrices with Gaussian entries. An important question, one of universality, is if our results hold when this assumption is relaxed. Indeed, one expects this to be true by the computations in Figures 1, 2 and 5 and the wealth of theoretical universality results from random matrix theory [PY14, BKYY16, BMP07].
2. The bidiagonalization of Wishart matrices and invariance
Definition 2.1**.**
For set . Let be an matrix of iid standard normal random variables () or where and are independent copies of an matrix of iid standard normal random variables (). Then
[TABLE]
has the -Wishart distribution. The associated empirical spectral measure (ESM) is given by
[TABLE]
where are the eigenvalues of . Define the averaged EMS (or density of states) by
[TABLE]
for every222 denotes bounded continuous functions on . .
Definition 2.2**.**
Marchenko–Pastur law on is given by the density
[TABLE]
The relation of the Marchenko–Pastur law to the eigenvalues of a Wishart matrix is given in the following section. But we demonstrate this relationship in Figure 4.
The Wishart distribution is invariant under orthogonal () or unitary () conjugation. Using , this means that if is a random unitary matrix that is independent of then
[TABLE]
For the Householder bidiagionalization procedure [TBI97] operates on on the left and the right with Householder reflections , and Householder reflections , so that
[TABLE]
where all entries are non-negative. Because of invariance, are independent -distributed random variables, see [DE02] and the references therein. Specifically,
[TABLE]
where all entries are independent. Define the infinite matrix by the entry-wise limit
[TABLE]
Therefore
[TABLE]
Lastly, define to be the upper-left submatrix of .
2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm
The conjugate gradient algorithm (CGA) for the iterative solution of
[TABLE]
is given by
Algorithm 1: Conjugate Gradient Algorithm
(1)
is the initial guess.
(2)
Set , .
(3)
For
(a)
Compute .
(b)
Set .
(c)
Set .
(d)
Compute .
(e)
Set .
The error at step is given by . Define the norm . A variational characterization of the CGA is that
[TABLE]
where \mathbb{P}_{k}^{(0)}=\{p:p~{}\text{ is a polynomial of degree k},~{}p(0)=1\}. The unique minimizer in can be described by the Lanczos algorithm. The Lanczos algorithm is a tridiagonalization algorithm given by
Algorithm 2: Lanczos Iteration
(1)
is the initial vector. Suppose
(2)
Set
(3)
For
(a)
Compute .
(b)
Set .
(c)
Compute and if , set .
The Lanczos algorithm produces a tridiagonal matrix
[TABLE]
and for some unitary matrix . We use to denote the upper-left subblock of . Then, it is well-known that (see [Gre89], for example)
[TABLE]
where as above. Then, write and for , so that ,
[TABLE]
Now, consider the special case , so that . We further analyze the relation with . The Lanczos algorithm gives the matrix representation of in the orthonormal basis found by applying the Gram–Schmidt procedure to the sequence
[TABLE]
So, the first vector is , and so the first column of is . The main consequence of this is that that first components of the eigenvectors333This is true modulo permutations and normalizations. of are the same as those of .
Basic assumption: Henceforth, throughout the paper we will assume .
Finally, we make a simple observation that the Householder bidiagonalization procedure [TBI97] applied to where (or Householder tridiagonalization applied to ) leaves the eigenvalues of unchanged and also leaves the first components of the eigenvectors unchanged. So, provided that Lanczos completes ( for ), the Householder bidiagonalization must produce . This is indeed true because a Jacobi matrix is uniquely defined by eigenvalues and first-components of eigenvectors (see, e.g. [DLT85]).
3. Main results
The proof of our main theorem is given in Section 5. The convention used in this paper is that and are fixed constants. The symbols with an assortment of subscripts will be used to denote constants and their (possible) dependencies. We suppress any dependence of these constants on but include dependence on , with a view to forthcoming work where we will allow to vary.
Theorem 3.1**.**
Assume the conjugate gradient algorithm is applied to solve where is a (possibly) random vector, independent of and . Let , be the associated error vectors.
- (1)
For any fixed and there exists a constant such that
[TABLE] 2. (2)
Furthermore
[TABLE]
*for some constant and a non-decreasing function that satisfies for . * 3. (3)
Lastly, if and , (1) and (2) hold.
Theorem 3.2**.**
In the setting of Theorem 3.1, for and , define444If , .
[TABLE]
Then as , . Furthermore,
[TABLE]
For , (3.1) holds if , and for
[TABLE]
Corollary 3.2.1**.**
In the setting of Theorem 3.1, for and and
[TABLE]
If these relations hold for .
Proof.
The first claim follows from the Borel–Cantelli Lemma. The second follows from the observation
[TABLE]
which gives
[TABLE]
∎
The last of our main results is almost just a corollary of the above theorems and it concerns halting times (i.e. runtimes or iteration counts, recall (1.1)):
[TABLE]
Since converges almost surely to and converges almost surely to we produce the candidate limit halting times
[TABLE]
Theorem 3.3**.**
In the setting of Theorem 3.1, , for suppose that for , , then555Note that is just a statement that for and for , and so , .
[TABLE]
If , i.e. , for then
[TABLE]
Proof.
Assume for . Then as as . Note that if then
[TABLE]
Then as
[TABLE]
We estimate these two terms individually. First, let be the event where is weakly decreasing for . Then
[TABLE]
For sufficiently large , by Lemma 4.8,
[TABLE]
and for such a value of
[TABLE]
by Theorem 3.1(2). It remains to show that . For this is immediate because . For , consider the event . Then and
[TABLE]
And this tends to zero by Theorem 3.1(2). The estimate for is analogous, but now we do not need monotonicity for and
[TABLE]
as . When , we use similar arguments to show that
[TABLE]
As and are integers, (3.2) follows. ∎
Remark 3.4*.*
We conjecture that if , i.e , for then
[TABLE]
Indeed Figure 5 indicates this is true because
[TABLE]
appears to be asymptotically normal with a variance that decays like . We note that this is related to, but not a consequence of, the central limit theorem for linear spectral statistics (CLT for LSS). For the CLT for LSS the variance decays as . In the case at hand, the fluctuations that occur in the random weights (see in (5.1)) and the fluctuations that occur in the random polynomial contribute to the variance on the order of . This conjecture will be resolved in a forthcoming publication.
4. Technical results from random matrix theory
Lemma 4.1**.**
Let be a chi distributed random variable with degrees of freedom. Then for any fixed integers and there exists such that
[TABLE]
Furthermore, for
[TABLE]
For ,
[TABLE]
Proof.
Because has a density given by
[TABLE]
we are led to analyze
[TABLE]
The result then follows by the change of variables and applying the method of steepest descent (Laplace’s method, see e.g., [AF03, Lemma 6.2.3]) for integrals along with Stirling’s approximation. Indeed, suppose is smooth and satisfies the bound for . Then for we must estimate
[TABLE]
Note that has a global minimum of zero at on . For , write
[TABLE]
which decays exponentially to zero as . The same calculation on gives
[TABLE]
which, again, decays exponentially because . Laplace’s method gives
[TABLE]
As remarked, Stirling’s approximation then gives the first inequality in the lemma. We note that
[TABLE]
For the second inequality then uses that (4.1)
[TABLE]
The last follows from (4.1) and a similar calculation
[TABLE]
∎
Definition 4.2**.**
A mean-zero random variable is called sub-exponential with parameters if for .
It then follows that centered chi variables are sub-exponential. If is sub-exponential, then clearly for some .
A good reference for the next classical result is [Ver18, Section 2.8].
Theorem 4.3** (Bernstein’s inequality for sub-exponential random variables).**
Let be a sequence of independent mean zero random variables and define
[TABLE]
Then for and any real numbers
[TABLE]
and . Here is some absolute constant independent of .
The estimate
[TABLE]
can be found by estimating the density for a random variable or by applying Bernstein’s inequality (Recall that a sum of independent chi-square variables , is again a chi-square variable with ). We will use three elementary facts that are encapsulated in the following lemma.
Lemma 4.4**.**
Let be random variables and assume . The following inequalities hold
- (1)
* where denotes the positive part and ,* 2. (2)
, and 3. (3)
.
Lemma 4.5**.**
Suppose . Let be independent chi-distributed random variables with degrees of freedom. Define weights
[TABLE]
Then the Kolmogorov–Smirnov distance
[TABLE]
of
[TABLE]
satisfies
[TABLE]
and the tail estimate
[TABLE]
for absolute constants .
Note that .
Proof.
First, it follows that
[TABLE]
So, we are led to analyze the sums
[TABLE]
As has expected value zero, Bernstein’s inequality gives for
[TABLE]
for absolute constants . From the moment generating function for a chi-square distribution, we have, as is a chi-square random variable with degrees of freedom,
[TABLE]
This minimum occurs at , giving
[TABLE]
Then we write for
[TABLE]
so that
[TABLE]
Now set , to find
[TABLE]
Then it is easy to see that for
[TABLE]
giving the estimate
[TABLE]
We then write and so that
[TABLE]
and then we apply each property of Lemma 4.4, in order, to obtain
[TABLE]
for . The tail estimate (4.3) follows by a union bound. We examine more closely, and get a crude bound
[TABLE]
While we do not specifically need the value, it follows that for a -squared random variable with degrees of freedom gives . In summary, we obtain
[TABLE]
Then, we can estimate moments
[TABLE]
where denotes the Gamma function. As for all , it follows that for some .
Thus for some absolute constant . By Jensen’s inequality
[TABLE]
and choosing we obtain
[TABLE]
Thus , for some new constant . Hence converges to zero in , in probability and almost surely666Because is always less than or equal to unity, almost sure convergence gives convergence, but we have obtained a rate.. ∎
Theorem 4.6** (Global eigenvalue bounds, see, e.g. [DS01]).**
*For the eigenvalues of a -Wishart distribution *
[TABLE]
for an absolute constant .
This immediately implies that for any interval such that there exists a constant such that
[TABLE]
And it also implies the bound on the distribution function for . Define so
[TABLE]
where . And the important conclusion from this is that
[TABLE]
where is independent of . We give an analogue of (4.5) for with replaced with in (4.11).
Lemma 4.7**.**
The marginal density of the smallest eigenvalue of satisfies
[TABLE]
Moreover, if ,
[TABLE]
where
[TABLE]
Proof.
We follow [ES05]. Define the multivariate Gamma function
[TABLE]
Then the joint density of the eigenvalues of is
[TABLE]
where
[TABLE]
Then
[TABLE]
Then we have
[TABLE]
Define the modified multivariate Gamma function
[TABLE]
and we have
[TABLE]
Then we need to simplify
[TABLE]
This all gives
[TABLE]
Then to estimate, we use that as for fixed to write
[TABLE]
where and . This is just the reciprocal of the Beta function with asymptotics
[TABLE]
as , found using Stirling’s formula. Since where we can write , where . Therefore
[TABLE]
Since is positive and bounded by unity, as
[TABLE]
This gives
[TABLE]
∎
Lemma 4.8**.**
For fixed ,
[TABLE]
as . If these estimates hold for .
Proof.
The case of is classical and implies weak convergence of the ESM to the Marchenko–Pastur law, see [BS10, Section 3.1], for example. For negative powers more work is required. Recall the definition (2.2)
[TABLE]
for all . We extend this definition to , . Introduce a continuous truncation of :
[TABLE]
The monotone convergence theorem gives
[TABLE]
The last term is finite for fixed , provided is sufficiently large by Lemma 4.7.
For the sake of notation, set . Then, consider
[TABLE]
We estimate each of these terms separately. First, we use that
[TABLE]
and show that this tends to zero exponentially. But to establish the last inequality introduce
[TABLE]
The dominated convergence theorem then provides
[TABLE]
And each term in the last sum is bounded by setting .
To estimate the expectation (4.9), we use estimates on the marginal density for . Specifically, (4.4) implies that for any
[TABLE]
for some constants that do not depend on . And so, for this term we are left estimating
[TABLE]
Then we use (4.6), introducing some constants to estimate
[TABLE]
for some . Indeed, this converges to zero super exponentially. From (4.10) we obtain
[TABLE]
and we may assume these constants are the same. Therefore . To estimate we write
[TABLE]
Then the Kolmogorov–Smirnov distance is given by [BS10, Theorem 8.10]
[TABLE]
And therefore . Finally, to the variance estimate (4.8) for . From [LP09, (4.16) and Remark 4.1]
[TABLE]
We then have
[TABLE]
Then
[TABLE]
The last term here tends to zero at a exponential rate. And because is a non-negative, monotonic function it suffices to estimate
[TABLE]
which vanishes again, at an exponential rate. This establishes (4.8) with in place of of . Then (4.8) follows from (4.7) once one notes that
[TABLE]
∎
The proof of Lemma 4.8 implies the following corollary, which complements the inequality (4.5).
Corollary 4.8.1**.**
For each there exists , independent of , such that
[TABLE]
The final results that we need from random matrix theory come from [GZ00, Corollary 1.8]
Theorem 4.9**.**
For any Lipschitz function
[TABLE]
for some constants .
Corollary 4.9.1**.**
Let be a continuous function on , Lipschitz in a neighborhood of , with at most polynomial growth at [math] and at , then
[TABLE]
for some constants .
Proof.
Let , such that is Lipschitz on . Then define
[TABLE]
And estimate
[TABLE]
By Theorem 4.9
[TABLE]
Using Theorem 4.6
[TABLE]
By assumption, there exists such that
[TABLE]
Following the proof of Lemma 4.8, we find that
[TABLE]
Similarly, using Theorem 4.6 (with the same constants, for convenience)
[TABLE]
So, the last quantity to estimate is
[TABLE]
Then this probability is bounded above by and
[TABLE]
Suppose . Then . Now because as for . Hence . However, if , then clearly so for all . The corollary follows by applying Lemma 4.4(2) twice. ∎
Remark 4.10*.*
One might expect Corollary 4.9.1 to hold for all . For this to indeed be true, needs to be globally Lipschitz (i.e., Lipschitz on every compact subset of ) and the dependence of and in Theorem 4.9 on needs to be known.
5. Proofs of the main theorems
Proof of Theorem 3.1.
We first use invariance. It follows that the errors , realized in the CGA are invariant under unitary transformations, i.e. for
[TABLE]
for any unitary matrix . This follows because if a polynomial of degree then (recall )
[TABLE]
And so, the mimimum over must be same in both cases. So, by invarince of it suffices to solve
[TABLE]
We then recall formula (2.6) for with
[TABLE]
where
[TABLE]
Here the distribution of is parameterized by (see Appendix A)
[TABLE]
where is a vector of iid -squared random variables. The variable is the square of the first components of the eigenvectors of . It is well-known that the eigenvalues and eigenvectors of are independent. But, is dependent on both the eigenvalues and eigenvectors.
Lemma 5.1**.**
For ,
[TABLE]
Proof.
We begin with a simple observation
[TABLE]
By Lemma 4.1 it follows that for , , where the bound is independent of . Similarly
[TABLE]
regardless of if is positive or negative, see (4.5) and (4.11). Therefore, it suffices to show that
[TABLE]
Because of (2.5) we have
[TABLE]
where these chi-distributed random variables are independent. Then repeated use of the identity
[TABLE]
gives
[TABLE]
The first term tends to zero in any norm by Lemma 4.1 at a rate , and the second term is bounded uniformly in any norm. ∎
Next, we argue that while the measure is still random, we can replace the integrand with a deterministic one.
Lemma 5.2**.**
For ,
[TABLE]
Proof.
Write . Using the notation of (5.2), it suffices to show that in at a rate . Consider the product
[TABLE]
where and where the chi random variables are independent. Using (5.3) we write
[TABLE]
where . Then
[TABLE]
Using (5.3) and Lemma 4.1 it follows that . Then for
[TABLE]
where and . This follows from again using Lemma 4.1, which implies that all norms of are bounded as . Then, one notes that the norm of can be bounded by a sum of terms of the form (5.4). This establishes this lemma.
∎
Lemma 5.3**.**
For ,
[TABLE]
Proof.
Write and integrate by parts
[TABLE]
where . Therefore
[TABLE]
Therefore
[TABLE]
by the independence of eigenvalues and eigenvectors ( is independent of the eigenvectors). Then, we just note that there exists power such that
[TABLE]
and therefore
[TABLE]
is bounded uniformly in by (4.5) and (4.11). The lemma follows from Lemma 4.5. ∎
These three lemmas combined with Lemma 4.8 establishes the Theorem 3.1(1).
For the second part, we again establish a series of lemmas.
Lemma 5.4**.**
For
[TABLE]
for a non-decreasing function that satisfies for , and for some constant .
Proof.
For , let be the event on which for and . Then
[TABLE]
where . This follows from (4.4). Now, we make two elementary observations about
[TABLE]
Recall (2.3) and it follows that is a Lipschitz function of the entries of in any closed -neighborhood of in the max norm777The max norm gives the maximum entry, in modulus. on lower-triangular matrices. Let be the Lipschitz constant. The second observation is to let be the event where
[TABLE]
By Lemma 4.1, for , for some constants . Therefore
[TABLE]
The lemma follows. ∎
Lemma 5.5**.**
For
[TABLE]
Proof.
Recalling the notation of the proof of the previous lemma, we then define the event
[TABLE]
Using the notation of (4.3)
[TABLE]
Then
[TABLE]
and then we find for a constant
[TABLE]
Therefore
[TABLE]
and this establishes the lemma. ∎
Applying Corollary 4.9.1 establish along with these two lemmas establishes Theorem 3.1(2).
For the case of and no inverse powers of will be encountered in any integral. So, the fact that Lemma 4.8 applies only for is not an issue. Theorem 4.6 holds for , Theorem 4.9 indeed holds for and Corollary 4.9.1 holds for provided the function is Lipschitz continuous at . And Lemmas 5.1, 5.2, 5.3, 5.4 and 5.5 hold for provided . ∎
Proof of Theorem 3.2.
To evaluate
[TABLE]
we make a simple change of variable so that
[TABLE]
Then examine
[TABLE]
Next, we define and compute
[TABLE]
Note that (5.7) is the recurrence relation for the Chebyshev polynomials and of the first and second kinds. We need some elementary properties of and (see, e.g. [MH03]):
[TABLE]
where the ′ denotes that the term is halved. From the last equality it follows by differentiation that
[TABLE]
Recalling (5.6) with we have
[TABLE]
Matching initial conditions for at and we obtain
[TABLE]
Therefore
[TABLE]
Continuing,
[TABLE]
and this gives
[TABLE]
For , we use for and to find
[TABLE]
Then
[TABLE]
And this gives
[TABLE]
For we find
[TABLE]
and this gives
[TABLE]
Lastly, one can use the bound to see that as provided .
∎
Appendix A The eigenvalues and eigenvectors of Wishart matrices
Let , . It is an important fact that the joint distribution of the vector
[TABLE]
can be parameterized by
[TABLE]
where is a vector of iid random variables. For the convenience of the reader we now derive (A.1).
Definition A.1**.**
(resp., ) denotes the group of orthogonal (resp., unitary) matrices.
We recall some general facts about Haar measure (see, e.g., [Fol99]).
Theorem A.2**.**
Let be a locally compact Hausdorff topological group. Then has a left invariant measure (i.e., a left Haar measure) and an right invariant measure (i.e., a right Haar measure) on the -algebra generated by all open subsets of . The measures are unique up to a positive multiplicative constant.
For a Borel set , let be the set of inverses of . Define
[TABLE]
Then it is easy to see that is a right Haar measure. Thus by uniqueness,
[TABLE]
for some . Now, on the other hand, the left translate of a right invariant measure is still right invariant. Thus, for all , by uniqueness,
[TABLE]
for some positive scaling factor — the modular function. is a continuous group homomorphism into the multiplicative group of positive numbers. A group is called unimodular if . Clearly, it follows from (A.3) that is unimodular if and only if Haar measure is both left and right invariant, i.e., , . There are many examples of unimodular groups: most importantly for us, compact groups, (such as or ) are unimodular.
If is unimodular, it follows from (A.2) that
[TABLE]
Setting in (A.4) we find . Hence
[TABLE]
In particular for , we see that
[TABLE]
and for
[TABLE]
We proceed to show that (A.1) holds if is distributed according to Haar measure on or . In order to construct Haar measure on (or resp.) let be an matrix of iid real (or complex, resp.) standard normal random variables. In such a setting we say that belongs to the real (or complex, resp.) Ginibre ensemble. Then the QR decomposition of gives
[TABLE]
The decomposition is unique if is non-singular. Now the Ginibre ensemble is clearly invariant under multiplication on the left by a matrix (or , resp.),
[TABLE]
So, the pair gives the QR decomposition of . Thus . Hence induces Haar measure on . But
[TABLE]
implying
[TABLE]
By (A.7) it follows that
[TABLE]
Remark A.3*.*
From (A.9) we see that the components of the “first” eigenvector are proportional to independent variables. But there is no “first” eigenvector: this should be the distribution for the components of any one eigenvector! This follows by just reordering the eigenvalues.
To establish (A.1), we now show that the eigenvectors of are Haar distributed on when the entries of are iid standard complex normal random variables. Here is , . Similarly, the eigenvectors of , is , , are Haar distributed on , where the entries of are iid standard (real) normal random variables.
We follow the argument in [For10].
Step 1
We consider the complex case, . The case is similar. Apply the QR decomposition to to obtain where is and is , , is upper triangular, .
We use the following notation: if
[TABLE]
is the matrix of differentials of , then denotes the wedge product of independent entries of ; e.g. if is real symmetric then
[TABLE]
Then one finds (see [For10, Proposition 3.2.5] that
[TABLE]
Step 2
From , we find (see [For10, Proposition 3.2.6])
[TABLE]
Step 3
Substituting (A.11) into (A.10) we find
[TABLE]
and so
[TABLE]
Integrating out the independent variables we finally arrive at the pdf for
[TABLE]
for some normalization constant .
Step 4
Recomputation in the case we find the general formula for the pdf of for , and some constant
[TABLE]
Step 5
. Now we use the standard computation for when it is Hermitian (): For the spectral decomposition () , we find (see [For10, (1.11)])
[TABLE]
and for ,
[TABLE]
Inserting (A.15), (A.14) into (A.13) we find the pdf of
[TABLE]
where for and is the Vandermonde for .
Finally we see that the singular values of and the singular vectors for are independent. As is left (and hence, right) invariant, we see that is Haar measure and hence the eigenvectors of are Haar distributed. Therefore (A.1) follows.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AF 03] M J Ablowitz and A S Fokas, Complex Varibles: Introduction and Applications , second ed., Cambridge University Press, 2003.
- 2[BKYY 16] A Bloemendal, A Knowles, H-T Yau, and J Yin, On the principal components of sample covariance matrices , Probability Theory and Related Fields 164 (2016), no. 1-2, 459–552.
- 3[BMP 07] Z. D. Bai, B. Q. Miao, and G. M. Pan, On asymptotics of eigenvectors of large sample covariance matrix , Annals of Probability 35 (2007), no. 4, 1532–1572.
- 4[BS 10] Z Bai and J W Silverstein, Spectral Analysis of Large Dimensional Random Matrices , Springer Series in Statistics, Springer New York, New York, NY, 2010.
- 5[BY 93] Zhidong Bai and Y.Q. Yin, Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix , Annals of Probability 21 (1993), no. 3, 1275–1294.
- 6[DE 02] I Dumitriu and A Edelman, Matrix models for beta ensembles , Journal of Mathematical Physics 43 (2002), no. 11, 5830.
- 7[DLT 85] P Deift, L C Li, and C Tomei, Toda flows with infinitely many variables , Journal of Functional Analysis 64 (1985), no. 3, 358–402.
- 8[DMOT 14] P A Deift, G Menon, S Olver, and T Trogdon, Universality in numerical computations with random data , Proceedings of the National Academy of Sciences of the United States of America 111 (2014), no. 42, 14973–8.
