Universality for eigenvalue algorithms on sample covariance matrices

Percy Deift; Thomas Trogdon

arXiv:1701.01896·math.NA·January 10, 2017·SIAM J. Numer. Anal.

Universality for eigenvalue algorithms on sample covariance matrices

Percy Deift, Thomas Trogdon

PDF

Open Access

TL;DR

This paper establishes a universal statistical behavior for the iteration count of eigenvalue algorithms applied to random sample covariance matrices, providing complexity estimates that hold with high probability.

Contribution

It proves a universal limit theorem for the halting time of key eigenvalue algorithms on sample covariance matrices, linking algorithm complexity to random matrix theory results.

Findings

01

Universal limit theorem for halting time of eigenvalue algorithms

02

High-probability complexity estimates for random covariance matrices

03

Application of eigenvalue and eigenvector statistics to algorithm analysis

Abstract

We prove a universal limit theorem for the halting time, or iteration count, of the power/inverse power methods and the QR eigenvalue algorithm. Specifically, we analyze the required number of iterations to compute extreme eigenvalues of random, positive-definite sample covariance matrices to within a prescribed tolerance. The universality theorem provides a complexity estimate for the algorithms which, in this random setting, holds with high probability. The method of proof relies on recent results on the statistics of the eigenvalues and eigenvectors of random sample covariance matrices (i.e., delocalization, rigidity and edge universality).

Equations366

T^{(k)} (H) := min {n : ∥ X_{n}^{(k)} ∥_{F} \leq ϵ}, ϵ > 0.

T^{(k)} (H) := min {n : ∥ X_{n}^{(k)} ∥_{F} \leq ϵ}, ϵ > 0.

\overset{ˉ}{T}_{def} (H_{G}) = \frac{T _{def} ( H _{G} ) - ⟨ T _{def} ( H _{G} )⟩}{σ _{G}},

\overset{ˉ}{T}_{def} (H_{G}) = \frac{T _{def} ( H _{G} ) - ⟨ T _{def} ( H _{G} )⟩}{σ _{G}},

\overset{ˉ}{T}_{def} (H_{B}) = \frac{T _{def} ( H _{B} ) - ⟨ T _{def} ( H _{B} )⟩}{σ _{B}} .

\frac{α}{2} := lo g ϵ^{- 1} / lo g N \geq 5/3 + σ /2,

\frac{α}{2} := lo g ϵ^{- 1} / lo g N \geq 5/3 + σ /2,

τ_{QR, ϵ} (H)

τ_{QR, ϵ} (H)

τ_{P, ϵ} (H, v)

τ_{IP, ϵ} (H, v)

F_{β}^{gap} (t)

F_{β}^{gap} (t)

= N \to \infty lim P (\frac{τ _{IP, ϵ} ( H , v )}{2 ^{- 7/6} λ _{-}^{1/3} d ^{- 1/2} N ^{2/3} ( lo g ϵ ^{- 1} - 2/3 lo g N )} \leq t)

= N \to \infty lim P (\frac{τ _{P, ϵ} ( H , v )}{2 ^{- 7/6} λ _{+}^{1/3} d ^{- 1/2} N ^{2/3} ( lo g ϵ ^{- 1} - 2/3 lo g N )} \leq t) .

ϵ^{- 1} ∣ [X_{τ_{QR, ϵ}}]_{N N} - λ_{1} ∣, ϵ^{- 1} ∣ λ_{IP, τ_{IP, ϵ}} - λ_{1} ∣, and ϵ^{- 1} ∣ λ_{P, τ_{P, ϵ}} - λ_{N} ∣

ϵ^{- 1} ∣ [X_{τ_{QR, ϵ}}]_{N N} - λ_{1} ∣, ϵ^{- 1} ∣ λ_{IP, τ_{IP, ϵ}} - λ_{1} ∣, and ϵ^{- 1} ∣ λ_{P, τ_{P, ϵ}} - λ_{N} ∣

ϵ^{- 2} ∣ [X_{τ_{QR, ϵ}}]_{N N} - λ_{1} ∣, ϵ^{- 2} ∣ λ_{IP, τ_{IP, ϵ}} - λ_{1} ∣, and ϵ^{- 2} ∣ λ_{P, τ_{P, ϵ}} - λ_{N} ∣

ϵ^{- 2} ∣ [X_{τ_{QR, ϵ}}]_{N N} - λ_{1} ∣, ϵ^{- 2} ∣ λ_{IP, τ_{IP, ϵ}} - λ_{1} ∣, and ϵ^{- 2} ∣ λ_{P, τ_{P, ϵ}} - λ_{N} ∣

E V_{ij} = 0, E ∣ V_{ij} ∣^{2} = 1,

E V_{ij} = 0, E ∣ V_{ij} ∣^{2} = 1,

P (∣ V_{ij} ∣ > x) \leq ν^{- 1} exp (- x^{ν}), x > 1.

P (∣ V_{ij} ∣ > x) \leq ν^{- 1} exp (- x^{ν}), x > 1.

E V_{ij}^{2} = 0,

E V_{ij}^{2} = 0,

μ_{N} (z) = E \frac{1}{N} i = 1 \sum N δ (λ_{i} - z),

μ_{N} (z) = E \frac{1}{N} i = 1 \sum N δ (λ_{i} - z),

ρ_{d} (x) := \frac{1}{2 π d} \frac{[( λ _{+} - x ) ( x - λ _{-} ) ] _{+}}{x ^{2}}, λ_{\pm}

ρ_{d} (x) := \frac{1}{2 π d} \frac{[( λ _{+} - x ) ( x - λ _{-} ) ] _{+}}{x ^{2}}, λ_{\pm}

\frac{n}{N} = \int_{- \infty}^{t} ρ_{d} (x) d x, n = 1, 2, \dots, N .

\frac{n}{N} = \int_{- \infty}^{t} ρ_{d} (x) d x, n = 1, 2, \dots, N .

N^{2/3} λ_{+}^{- 2/3} d^{1/2} (λ_{+} - λ_{N}, λ_{+} - λ_{N - 1}, λ_{+} - λ_{N - 2})

N^{2/3} λ_{+}^{- 2/3} d^{1/2} (λ_{+} - λ_{N}, λ_{+} - λ_{N - 1}, λ_{+} - λ_{N - 2})

N^{2/3} λ_{-}^{- 2/3} d^{1/2} (λ_{1} - λ_{-}, λ_{2} - λ_{-}, λ_{3} - λ_{-})

N^{2/3} λ_{-}^{- 2/3} d^{1/2} (λ_{1} - λ_{-}, λ_{2} - λ_{-}, λ_{3} - λ_{-})

F_{β}^{gap} (t) = P (\frac{1}{Λ _{2, β} - Λ _{1, β}} \leq t)

F_{β}^{gap} (t) = P (\frac{1}{Λ _{2, β} - Λ _{1, β}} \leq t)

= N \to \infty lim P (\frac{1}{2 ^{- 7/6} N ^{2/3} λ _{-}^{- 2/3} d ^{- 1/2} ( λ _{2} - λ _{1} )} \leq t) .

P (∣ X_{N} / a_{N} ∣ < R) = 1 + o (1)

P (∣ X_{N} / a_{N} ∣ < R) = 1 + o (1)

P (R_{N, s}) = 1 + o (1),

P (R_{N, s}) = 1 + o (1),

p ↓ 0 lim N \to \infty lim sup P (U_{N, p}^{c}) = p ↓ 0 lim N \to \infty lim sup P (L_{N, p}^{c}) = 0.

p ↓ 0 lim N \to \infty lim sup P (U_{N, p}^{c}) = p ↓ 0 lim N \to \infty lim sup P (L_{N, p}^{c}) = 0.

N \to \infty lim P (λ_{3} - λ_{2} < p (λ_{2} - λ_{1})) = P (Λ_{3, β} - Λ_{2, β} < p (Λ_{2, β} - Λ_{1, β})) .

N \to \infty lim P (λ_{3} - λ_{2} < p (λ_{2} - λ_{1})) = P (Λ_{3, β} - Λ_{2, β} < p (Λ_{2, β} - Λ_{1, β})) .

p ↓ 0 lim P (Λ_{3, β} - Λ_{2, β} < p (Λ_{2, β} - Λ_{1, β}))

p ↓ 0 lim P (Λ_{3, β} - Λ_{2, β} < p (Λ_{2, β} - Λ_{1, β}))

= P (Λ_{3, β} = Λ_{2, β}) .

N \to \infty lim P (λ_{3} - λ_{2} < p (λ_{2} - λ_{1})) = N \to \infty lim P (\frac{λ _{2}}{λ _{3}} < (\frac{λ _{1}}{λ _{2}})^{p}) .

N \to \infty lim P (λ_{3} - λ_{2} < p (λ_{2} - λ_{1})) = N \to \infty lim P (\frac{λ _{2}}{λ _{3}} < (\frac{λ _{1}}{λ _{2}})^{p}) .

Γ_{N} := λ_{3} - λ_{2} - p (λ_{2} - λ_{1}) + λ_{-} [\frac{λ _{2}}{λ _{3}} - (\frac{λ _{1}}{λ _{2}})^{p}]

Γ_{N} := λ_{3} - λ_{2} - p (λ_{2} - λ_{1}) + λ_{-} [\frac{λ _{2}}{λ _{3}} - (\frac{λ _{1}}{λ _{2}})^{p}]

P (∣ Γ_{N} ∣ \geq δ) = P (∣ Γ_{N} ∣ \geq δ, B_{R}) + P (∣ Γ_{N} ∣ \geq δ, B_{R}^{c}) .

P (∣ Γ_{N} ∣ \geq δ) = P (∣ Γ_{N} ∣ \geq δ, B_{R}) + P (∣ Γ_{N} ∣ \geq δ, B_{R}^{c}) .

\frac{λ _{2}}{λ _{3}} - (\frac{λ _{1}}{λ _{2}})^{p} = λ_{-}^{- 1} N^{- 2/3} (ξ_{2} - ξ_{3}) - p λ_{-}^{- 1} N^{- 2/3} (ξ_{1} - ξ_{3}) + O (N^{- 4/3}) .

\frac{λ _{2}}{λ _{3}} - (\frac{λ _{1}}{λ _{2}})^{p} = λ_{-}^{- 1} N^{- 2/3} (ξ_{2} - ξ_{3}) - p λ_{-}^{- 1} N^{- 2/3} (ξ_{1} - ξ_{3}) + O (N^{- 4/3}) .

N \to \infty lim sup P (∣ Γ_{N} ∣ \geq δ) \leq N \to \infty lim sup P (B_{R}^{c}) .

N \to \infty lim sup P (∣ Γ_{N} ∣ \geq δ) \leq N \to \infty lim sup P (B_{R}^{c}) .

τ_{A, ϵ} = := D_{1} τ_{A, ϵ} - T_{A, ϵ} + := D_{2} T_{A, ϵ} - T_{A, ϵ}^{*} + T_{A, ϵ}^{*} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Advanced Combinatorial Mathematics · Advanced Algebra and Geometry

Full text

Universality for eigenvalue algorithms on sample covariance matrices

Percy Deift

Courant Institute of Mathematical Sciences, New York University, 251 Mercer St., New York, NY 10012, USA

[email protected]

and

Thomas Trogdon

Department of Mathematics, University of California, Irvine, Irvine, CA 92697-3875, USA

[email protected]

Abstract.

We prove a universal limit theorem for the halting time, or iteration count, of the power/inverse power methods and the QR eigenvalue algorithm. Specifically, we analyze the required number of iterations to compute extreme eigenvalues of random, positive-definite sample covariance matrices to within a prescribed tolerance. The universality theorem provides a complexity estimate for the algorithms which, in this random setting, holds with high probability. The method of proof relies on recent results on the statistics of the eigenvalues and eigenvectors of random sample covariance matrices (i.e., delocalization, rigidity and edge universality).

Key words and phrases:

universality, eigenvalue computation, random matrix theory

2000 Mathematics Subject Classification:

15B52, 65L15, 70H06

The authors would like to thank Folkmar Bornemann for the data to display $F_{2}^{\mathrm{gap}}(t)$ . This work was supported in part by grants NSF-DMS-1303018 (TT) and NSF-DMS-1300965 (PD).

1. Introduction

In this paper, we prove a universal limit theorem for the fluctuations in the runtime (or halting time) of three classical eigenvalue algorithms applied to positive-definite random matrices. The theorem is universal in the sense that the limiting distribution does not depend on the distribution of the entries of the matrix (within a class).

One can trace the search for universal behavior in eigenvalue algorithm runtimes to the largely-experimental work of Pfrang, Deift and Menon [20]. The authors considered three algorithms (QR, matrix sign and Toda) and ran the algorithms to the time of first deflation, which we now describe in more detail. Given an $N\times N$ matrix $H$ , the algorithms produce isospectral iterates $X_{n}$ , $X_{0}=H$ , $\mathrm{spec}\,X_{n}=\mathrm{spec}\,H$ , and generically $X_{n}\to\operatorname{diag}(\lambda_{1},\ldots,\lambda_{N})$ . Necessarily, the $\lambda_{i}$ ’s are the eigenvalues of $H$ . However, one does not typically run the algorithm until the norm of all of the off-diagonal entries is small. Rather, one considers the submatrices $X_{n}^{(k)}$ which consist of the entries of $X_{n}$ that are in the first $k$ rows and the last $N-k$ columns. The $k$ -deflation times are defined as

[TABLE]

Here $\|\cdot\|_{\mathrm{F}}$ denotes the Frobenius norm111The authors in [20] actually considered a scaled $\infty$ -norm rather than the Frobenius norm.. Then the time of first deflation is given by $T_{\mathrm{def}}(H):=\min_{1\leq k\leq N-1}T^{(k)}(H)$ . We define $\hat{k}=\hat{k}(H)$ to be the largest value of $k$ such that $T^{(k)}(H)=T_{\mathrm{def}}(H)$ . It follows that when $k=\hat{k}(H)$ , the eigenvalues of the leading $k\times k$ and $(N-k)\times(N-k)$ submatrices approximate the eigenvalues of $H$ to $\mathcal{O}(\epsilon)$ . The algorithm is then applied to the smaller submatrices, and so on.

A typical experiment from [20] goes as follows. Let $Y_{\mathrm{G}}$ and $Y_{\mathrm{B}}$ be $N\times N$ matrices of iid standard normal and iid mean-zero, variance-one Bernoulli random variables, respectively. Then define $H_{\mathrm{G}}=(Y_{\mathrm{G}}+Y_{\mathrm{G}}^{T})/\sqrt{2N}$ and $H_{\mathrm{B}}=(Y_{\mathrm{B}}+Y_{\mathrm{B}}^{T})/\sqrt{2N}$ which are real, symmetric random matrices (see [6] for complex Hermitian matrices). After sampling the integer-valued random variables $T_{\mathrm{def}}(H_{\mathrm{G}})$ and $T_{\mathrm{def}}(H_{\mathrm{B}})$ , for $N$ large, say 10,000 times, we define the empirical fluctuations

[TABLE]

where $\langle\cdot\rangle$ and $\sigma_{(\cdot)}$ represent the sample mean and sample standard deviation, respectively. We plot the histograms for the empirical fluctuations in Figure 1. The histograms overlap surprisingly well for the two different ensembles, indicating that after centering and rescaling, the distribution of the time of first deflation is universal.

Proving theorems about the random variable $T_{\mathrm{def}}(H)$ is particularly difficult as one has to analyze the minimum of $N-1$ correlated random variables. In [8] the authors proved a (universal) limit theorem for the $1$ -deflation time of the so-called Toda algorithm. In this paper we prove an analog of that result for the $(N-1)$ -deflation time for the QR (eigenvalue) algorithm acting on positive definite matrices. This is an important first step in proving a limit theorem for $T_{\mathrm{def}}(H)$ because it is the most likely that $\hat{k}=N-1$ , see Figure 1. We also include similar results for the power (P) and inverse power (IP) methods as the analysis is similar. For these two methods, we incorporate random starting vectors. The analysis and results of the current work are quite similar to that in [8], where we prove universality for the Toda eigenvalue algorithm, showing its wide applicability.

1.1. Relation to previous work and complexity theory

The statistical analysis of algorithms has been performed in many settings, usually with an eye towards complexity theory. In relation to Gaussian elimination, the seminal work is the analysis of Goldstine and von Neumann [13] on the condition number of random matrices. This is closely related to the later work of Edelman [9], also on condition numbers. The expected number of pivot steps in the simplex algorithm was analyzed by Smale [27] and Borgwardt [4]. The methodology of smoothed analysis was introduced in [28] and applied in a variety of settings [17, 19, 24].

The closest work, within the realm of complexity theory, to the current work is that of Kostlan [16]. Kostlan showed that for the power method on $H_{G}$ the expected halting time to compute an eigenvector is infinite. Kostlan showed that when one conditions on all of the eigenvalues being positive, the upper bound on the halting time is $\mathcal{O}(N^{2}\log N)$ . Instead of conditioning, and eigenvector computation, we turn to sample covariance matrices which (with high probability) have positive eigenvalues and use the power methods to compute the extreme eigenvalues. With this we are able to determine the precise limiting distribution of the halting time, which contains far more information than simply an upper bound. To our knowledge, this is the first time this has been done for a classical numerical method.

For $\alpha$ in the scaling region given by Condition 2.1, the halting times given in Theorem 1 scale like $(\alpha-2/3)N^{2/3}\log N$ in order to obtain an accuracy of $N^{-\alpha/2}$ . This is a key conclusion of our results which gives an estimate on the complexity of the QR algorithm, and also the power and inverse power methods.

Through many detailed computations, universality in numerical computation has been observed in many numerical algorithms beyond the QR algorithm and the power and inverse power methods (see [8, 6, 7, 20, 23] ): the conjugate gradient algorithm, the matrix sign eigenvalue algorithm, the Toda eigenvalue algorithm, the Jacobi eigenvalue algorithm, the GMRES algorithm, a genetic algorithm and the gradient and stochastic gradient descent algorithms. This work presents further examples, in addition to [8], where one can prove this type of universality. This advances the contention of the authors that universality is a bona fide and basic phenomenon in numerical computation.

1.2. Open questions

The main open question related to this work is the asymptotics of the time of first deflation $T_{\mathrm{def}}$ . A related and unknown detail is the tail behavior of the limiting distribution. As discussed in detail in [8], the limiting distribution in Theorem 1 in [8] for the halting time has one finite moment for real matrices and two finite moments for complex matrices. If one constructed an algorithm with a sub-Gaussian limiting distribution, it may be preferable. We believe this is the case for $T_{\mathrm{def}}$ . We also believe that its distribution is related to the largest gap in the spectrum of the stochastic Airy operator [22]. Furthermore, can one extend our results for the QR algorithm to indefinite ensembles?

We only consider random matrices with entries that are exponentially localized, see 1. It is not known if this condition can be relaxed but it is itself an important open question. Finally, additional halting criteria can be employed. One could look for the time to compute eigenvectors with the power method, or to compute the entire spectrum with the QR algorithm.

2. Main results

In this paper we discuss computing the smallest and largest eigenvalues of random positive definite matrices to an accuracy $\epsilon$ . We have a basic condition that we enforce on $\epsilon$ which requires that $\epsilon$ is appropriately small.

Condition 2.1.

[TABLE]

for $0<\sigma<1/3$ fixed.

For $j=0,1,2,\dots$ , we let $X_{j}$ be the iterates of the QR algorithm (QR, defined in Section 5.2) and $\lambda_{\mathrm{P},j}$ and $\lambda_{\mathrm{IP},j}$ be the iterates of the power and inverse power methods, respectively (P and IP, respectively, defined in Section 5.1). We specify (discrete) halting times for these algorithms applied to a matrix $H$ with starting vector $v$ as follows:

[TABLE]

Note that for the QR algorithm the $(N,N)$ entry of $X_{j}$ , $[X_{j}]_{NN}$ , is an approximation of the smallest eigenvalue $\lambda_{1}$ as is $\lambda_{\mathrm{IP},j}$ . On the other hand, $\lambda_{\mathrm{P},j}$ is an approximation of the largest eigenvalue $\lambda_{N}$ . Our main results are summarized in the following Theorem and Propsosition. See Definition 1 for the definition of sample covariance matrices and Definition 3 for the distribution function $F_{\beta}^{\mathrm{gap}}(t)$ . The constants $\lambda_{\pm}$ and $d$ are given in (2).

Theorem 1 (Universality).

Let $H$ be a real ( $\beta=1$ ) or complex ( $\beta=2$ ) $N\times N$ sample covariance matrix and let $v$ be a (random) unit vector independent of $H$ . Assuming $\epsilon$ satisfies Condition 2.1, for $t\in\mathbb{R}$

[TABLE]

This theorem is a direct consequence of Theorems 5 and 6, after noting that, for example, $|\tau_{\mathrm{QR},\epsilon}-T_{\mathrm{QR},\epsilon}|\leq 1$ where $T_{\mathrm{QR},\epsilon}$ appears in Theorem 5. This is a universality theorem in the sense that it states that for $N$ is sufficiently large the distribution of the halting time is independent of the distribution on $H$ .

The following Proposition shows that we obtain an accuracy of $\epsilon$ but not of $\epsilon^{2}$ , i.e. that our halting criteria are sufficient but not too restrictive. It is a restatement of Propositions 4 and 5.

Proposition 1.

Assuming $\epsilon$ satisfies Condition 2.1, for any real or complex sample covariance matrix

[TABLE]

converge to zero in probability, while

[TABLE]

converge to $\infty$ in probability.

A numerical demonstration of Theorem 1 is given in Section 4.

The outline of the paper is as follows. In Section 3 we discuss the fundamental results of random matrix theory that are required to prove our results. In Section 4 we give a numerical demonstration of Theorem 1. Next, in Section 5, we discuss the fundamentals of the power methods and the QR algorithm before we apply the random matrix estimates in Section 6 to prove our results. In Appendix A we analyze the true error of the methods with our chosen halting criteria to see that these criteria are indeed appropriate to the task. Finally, in Appendix B we discuss the asymptotic normality of eigenvector projections of random vectors. This allows us to show that Theorem 1 indeed holds for random starting vectors in the power and inverse power methods.

3. Results from random matrix theory

We now introduce the ideas and results from random matrix theory that are needed to prove our main theorems. Let $V$ be an $M\times N$ real or complex matrix with $M\geq N$ . We consider the ordered eigenvalues $\lambda_{j}(H)=\lambda_{j}$ , $j=1,2,\ldots,N$ of $H=V^{*}V/M$ , $\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{N}$ . Let $\beta_{1},\beta_{2},\ldots,\beta_{N}$ denote the absolute value of the last components of the associated normalized eigenvectors. We only consider sample covariance matrices from independent samples.

Definition 1 (Sample covariance matrix (SCM)).

A sample covariance matrix (ensemble) is a real symmetric ( $\beta=1$ ) or complex Hermitian ( $\beta=2$ ) matrix $H=V^{*}V/M$ , $V=(V_{ij})_{1\leq i\leq M,1\leq j\leq N}$ such that $V_{ij}$ are independent random variables for $1\leq i\leq M$ , $1\leq j\leq N$ given by a probability measure $\nu_{ij}$ with

[TABLE]

Next, assume there is a fixed constant $\nu$ (independent of $N,i,j$ ) such that

[TABLE]

For $\beta=2$ (when $V_{ij}$ is complex-valued) the condition

[TABLE]

must also be satisfied.

We assume all SCMs have $M\geq N$ . Define the averaged empirical spectral measure

[TABLE]

where the expectation is taken with respect to the given ensemble. For technical reasons we let $M=M(N)$ and $d_{N}:=N/M$ satisfy $\lim_{N\to\infty}d_{N}=:d\in(0,1)$ . More specifically, we consider $M=\lfloor N/d\rfloor$ .

Remark 3.1.

The case where $\lim_{N\to\infty}d_{N}=1$ is of considerable interest: If $M=N+R$ then it is known that the limiting distribution of the smallest eigenvalue is given in terms of the so-called Bessel kernel [2, 10] when $X_{ij}$ has Gaussian divisible entries. If $R\to\infty$ , $R\leq CN^{1/2}$ and $X_{ij}$ are standard complex normal random variables then it is known that the smallest eigevalue has Tracy–Widom fluctuations [7]. It is noted in [21, Section 1.4] that establishing all estimates we use below in the $\lim_{N\to\infty}d_{N}=1$ case is a difficult problem. In light of the current work, this is a particularly interesting problem as it would give different scalings for the halting times.

Define the Marchenko–Pastur law

[TABLE]

and $[\cdot]_{+}$ denotes the positive part. For SCMs, $\mu_{N}$ converges to $\rho_{d}(x)\mathrm{d}x$ weakly and $\rho_{d}(x)\mathrm{d}x$ is called the equilibrium measure for the ensemble (see, for example, [18, 21, 26, 29, 32]).

Definition 2.

Define $\gamma_{n}$ to be the smallest value of $t$ such that

[TABLE]

Thus $\{\gamma_{n}\}$ represent the quantiles of the equilibrium measure. We now describe conditions on the matrices that simplify the analysis of the algorithms QR, P and IP.

Condition 3.1.

For $0<p<\sigma/4$ ,

•

$\frac{\lambda_{N-2}}{\lambda_{N-1}}<\left(\frac{\lambda_{N-1}}{\lambda_{N}}\right)^{p}$ .

Let $\mathcal{U}_{N,p}$ denote the set of matrices that satisfy this condition.

Condition 3.2.

For $0<p<\sigma/4$ ,

•

$\frac{\lambda_{2}}{\lambda_{3}}<\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{p}$ .

Let $\mathcal{L}_{N,p}$ denote the set of matrices that satisfy this condition.

Given an SCM, let $v$ be a random (or deterministic) unit vector independent of the SCM. Define $\beta_{n}=|\langle v,u_{n}\rangle|$ , $n=1,2,\ldots,N$ where $u_{n}$ is the $n$ th eigenvector of the SCM.

Condition 3.3.

For any fixed $0<s<\sigma/40$ ,

(1)

$\beta_{n}\leq N^{-1/2+s/2}$ * for all $n$ * 2. (2)

$N^{-1/2-s/2}\leq\beta_{n}$ * for $n=1,2,N-1,N$ ,* 3. (3)

$N^{-2/3-s/2}\leq\lambda_{N}-\lambda_{n-1}\leq N^{-2/3+s/2}$ , for $n=N,N-1$ , 4. (4)

$N^{-2/3-s/2}\leq\lambda_{n}-\lambda_{1}\leq N^{-2/3+s/2}$ , for $n=2,3$ , and 5. (5)

$|\lambda_{n}-\gamma_{n}|\leq N^{-2/3+s/2}(\min\{n,N-n+1\})^{-1/3}$ * for all $n$ .*

Let $\mathcal{R}_{N,s}$ denote the set of matrices that satisfy these conditions.

Remark 3.2.

Clearly the quantiles $\{\gamma_{n}\}$ lie in the interval $(\lambda_{-},\lambda_{+})$ . Property (5) above implies, in particular, that for $N$ sufficiently large, the eigenvalues $\{\lambda_{n}\}$ of matrices in $\mathcal{R}_{N,s}$ lie in the interval $(\lambda_{-}-\eta,\lambda_{+}+\eta)$ for any given $\eta>0$ .

The analysis of the eigenvalues of sample covariance matrices has a long history, beginning with the work of Marc̆enko and Pastur [18]. The seminal work of Geman [12] showed that for $M,N\to\infty$ , $N/M\to y\in(0,\infty)$ , the largest eigenvalue of an SCM converges a.s. to $\lambda_{+}$ . Silverstein [25] established that for $M,N\to\infty$ , $N/M\to y\in(0,1)$ the smallest eigenvalue converges a.s. to $\lambda_{-}$ when $V_{ij}$ are iid standard normal random variables. See [10, 14, 15] for the first results on the fluctuations of the largest and smallest eigenvalues when $V_{ij}$ are iid (real or complex) standard normal distributions. Universality for the eigenvalues of $\frac{1}{N}V^{*}V$ at the edges and in the bulk, was first proved by Ben Arous an Peché [2] for Gaussian divisible ensembles, in the limit $N,M\to\infty$ , $M=N+\nu$ , $\nu$ fixed. We reference [21] and [3] for the most comprehensive results. Note that we require (1) which is stronger than the assumptions in [12, 32] which only require moment conditions. Various limits of the eigenvectors have also been considered, see [1, 26]. But we reference [3] for the full generality we need to prove our theorems.

Theorem 2.

*For SCMs *

[TABLE]

and

[TABLE]

separately converge jointly in distribution to random variables $(\Lambda_{1,\beta},\Lambda_{2,\beta},\Lambda_{3,\beta})$ which are the smallest three eigenvalues of the so-called stochastic Airy operator. Furthermore, $(\Lambda_{1,\beta},\Lambda_{2,\beta},\Lambda_{3,\beta})$ are distinct with probability one.

Proof.

The first statement follows from [3, Theorem 8.3]. The second statement follows from [21, Theorem 1.1 & Corollary 1.2]. The fact that the eigenvalues of the stochastic Airy operator are distinct is shown in [22, Theorem 1.1]. ∎

Definition 3.

The distribution function $F^{\mathrm{gap}}_{\beta}(t)$ , supported on $t\geq 0$ for $\beta=1,2$ is given by

[TABLE]

The remaining theorems in this section are compiled from results that have been obtained recently in the literature. We use a simple lemma (see, for example, [8, Lemma 3.2]):

Lemma 1.

If $X_{N}\to X$ in distribution222For convergence in distribution, we require that the limiting random variable $X$ satisfies $\mathbb{P}(|X|<\infty)=1$ . as $N\to\infty$ then for any $R>0$

[TABLE]

as $N\to\infty$ provided that $a_{N}\to\infty$ .

Theorem 3.

For SCMs, Condition 3.3 holds with high probability as $N\to\infty$ , that is, for any $s>0$

[TABLE]

as $N\to\infty$ .

Proof.

It suffices to show that each of the sub-conditions 1-5 in Condition 3.3 hold with high probability. Conditions 3.3.1-2 hold with high probability directly by Proposition 6. Conditions 3.3.3-4 hold with high probability by the joint convergence of the top (bottom) three eigenvalues in Theorem 2 and Lemma 1. Finally, Condition 3.3.5 holds with high probability as a direct consequence of [21, Theorem 3.3]. ∎

Theorem 4.

For SCMs,

[TABLE]

Proof.

It follows from Theorem 2 that

[TABLE]

Then

[TABLE]

But from [22, Theorem 1.1] $\mathbb{P}(\Lambda_{3,\beta}=\Lambda_{2,\beta})=0$ . And so, it suffices to show that

[TABLE]

This will, in turn follow, if we show that

[TABLE]

converges to zero in probability for $p$ fixed. We set $\lambda_{j}=\lambda_{-}+N^{-2/3}\xi_{j}$ where $(\xi_{1},\xi_{2},\xi_{3})$ converges jointly in distribution by Theorem 2. Let $B_{R}$ be the event $\|(\xi_{1},\xi_{2},\xi_{3})\|\leq R$ and for $\delta>0$ consider

[TABLE]

Given $B_{R}$ , we perform a formal expansion

[TABLE]

Therefore, given $B_{R}$ , $\Gamma_{N}$ tends to zero uniformly and we find

[TABLE]

Because of joint convergence (in distribution) of $(\xi_{1},\xi_{2},\xi_{3})$ , the right-hand side tends to zero as $R\to\infty$ . This establishes the result for $\mathcal{L}_{N,p}$ . Similar considerations yield the result for $\mathcal{U}_{N,p}$ . ∎

4. A numerical demonstration

We include some numerical simulations that serve to demonstrate Theorem 1. We include ideas that were discussed in detail in [8]. En route to proving Theorem 1 we perform the following approximation step for A = QR, IP or P

[TABLE]

where $T^{*}_{\mathrm{A},\epsilon}$ is given in (7) and (14) below. The difference $D_{1}$ is always less than unity and the difference $D_{2}$ is $\mathcal{O}(N^{2/3})$ (see Proposition 2, for example). Then $T^{*}_{\mathrm{A},\epsilon}$ converges in distribution, after rescaling, to $F_{\beta}^{\mathrm{gap}}$ but it is clear from the proof of Theorem 6 that the rate of covergence is logarithmic, at best. To improve the rate we note that

[TABLE]

for any constant $\zeta_{A}$ . Here $\lambda_{+}$ is taken if A = P and $\lambda_{-}$ is taken if A = QR, IP. We choose $\zeta_{\mathrm{QR}}$ (cf. with $\zeta$ chosen in [8]), using (7), by

[TABLE]

After examining (14), we choose

[TABLE]

Then changing $\lambda_{2}\to\lambda_{N-1}^{-1}$ and $\lambda_{1}\to\lambda_{N}^{-1}$ in (14) we choose

[TABLE]

Despite the fact that these $\zeta_{\mathrm{A}}$ ’s are not constant, from Theorem 2 one should expect they have well-defined limits as $N\to\infty$ . These effective constants can be easily approximated by sampling the associated matrix distributions.

In Figure 2 we demonstrate (3) and hence Theorem 1 for the QR algorithm. Figures 3 and 4 demonstrate the analogous results for the inverse power method and power method, respectively. The ensembles we use are the following:

LOE

: $V$ (in Definition 1 below) has iid standard real Gaussian entries,

LUE

: $V$ has iid standard complex Gaussian entries,

BE

: $V$ has iid mean-zero, variance-one Bernoulli entries ( $\pm 1$ with equal probability),

CBE

: $V$ has iid mean-zero, variance-one complex Bernoulli entries ( $\{a,-a,\bar{a},-\bar{a}\}$ , $a=(1+i)/2$ , with equal probability)

The density $\frac{d}{dt}F_{1}^{\mathrm{gap}}(t)$ was computed by the authors in [8]. We sample the matrix distributions for $N$ large and use appropriate interpolation. The density $\frac{d}{dt}F_{2}^{\mathrm{gap}}(t)$ was computed in [31] (and rescaled in [8]) and the data to reproduce it here was provided by the authors of that work.

Finally, in Figure 5 we show the statistics of the time of first deflation, as defined in the introduction, for LOE and BE when $d=2$ . This demonstrates universality for the time of first deflation but the limiting distribution (whatever it may be!) is clearly distinct from both histograms in Figure 1 and the limiting distribution in Theorem 1. And so, computing the limiting distribution for the rescaled time of first deflation requires information about much more than just the $(N-1)$ -deflation time.

5. Fundamentals of the algorithms

Here we discuss the QR algorithm and power/inverse power methods. We derive explicit formulae to analyze the halting times of the algorithms.

5.1. The power and inverse power methods

Let $Y_{1},Y_{2},Y_{3},\ldots,$ be a sequence of independent, real, mean-zero, and variance-one random variables. The power and inverse power methods with random starting are given in Algorithms 1 and 2.

The power method (see Algorithm 1 above) is halted when successive approximations have a difference that is less than $\epsilon^{2}$ . Our analysis reveals (see Proposition 1 and Remark A.2) that typically $|\lambda-\lambda_{\mathrm{old}}|$ is less than the true error $|\lambda-\lambda_{N}|$ and so one has to run until the difference is $\epsilon^{2}$ . Similarly, the inverse power method is given by Algorithm 2 below where we use the convention $0^{-1}=\infty$ .

Let $H=U\Lambda U^{*}$ , $U=(u_{1},u_{2},\ldots,u_{N})$ be a spectral decomposition for the matrix $H$ . A random unit vector is given by

[TABLE]

for the given random variables $Y_{j}$ . With the inverse power method, at each iteration, $t=1,2,3,\ldots$ we have

[TABLE]

For the power method we have

[TABLE]

5.1.1. The halting time

We define the halting time for the inverse power method as

[TABLE]

Provided that the smallest eigenvalue of $H$ is order $1$ , this halting condition will give the same order of approximation in $\epsilon$ as a possibly more natural condition $\inf\{t:|\lambda_{\mathrm{IP}}(t)-\lambda_{\mathrm{IP}}(t+1)|\leq\epsilon\}$ . We choose (4) for convenience and show it is sufficient. Similarly, the halting time for the power method is

[TABLE]

Define the function

[TABLE]

Using the notation $\delta_{n}=\lambda_{1}^{2}/\lambda^{2}_{n}\leq 1$ , $\nu_{n}=\beta_{n}^{2}/\beta_{1}^{2}$ , we have

[TABLE]

Note that

[TABLE]

Remark 5.1.

We focus on the inverse power method here. There is an anlogous function $E_{\mathrm{P}}(t)$ for the power method which can be found through the mapping $\lambda_{j}\to\lambda_{j}^{-1}$ . And so, if we can estabilish properties of $E_{\mathrm{IP}}(t)$ under assumptions on $H$ that $H^{-1}$ also satifies, the properties extend to $E_{\mathrm{P}}(t)$ .

5.2. The QR (eigenvalue) algorithm

Unlike the power and inverse power methods, the convergence criterion for the QR algorithm (without shifts) is much more subtle even though convergence is guaranteed for the matrices we consider [11]. We consider a general error control function $f(H)\geq 0$ , see below. The basis of the algorithm is the QR factorization of a non-singular matrix. We use $(Q,R)=\mathrm{QR}(H)$ ( $H=QR$ ) to denote this factorization where $Q$ is unitary and $R$ is upper-triangular with positive diagonal entries. The QR factorization can be found via the modified Gram–Schmidt procedure or Householder reflections, for example. It is unique when it exists. The QR algorithm is given by the following steps:

Provided that $f$ is suitably chosen, (a subset of) the diagonal entries of $X$ will be an approximation of eigenvalues of $H$ . We develop a more analytically tractable description of the QR algorithm. For a positive-definite matrix $H$ , let $H^{t}$ denote its $t$ th power, $t\geq 0$ . Define $Q(t)$ , $R(t)$ and $X(t)$ via

[TABLE]

For the QR algorithm we are interested in $t\in\mathbb{N}$ but for additional remarks we want to consider $t\geq 0$ . And so, it is important to note that $Q(t)$ and $R(t)$ are infinitely differentiable matrix-valued functions of $t$ .

It is well known that $X(n)$ , $n=0,1,2,\ldots$ gives the iterates $X_{n}$ of the QR algorithm, with, of course, $X(0)=X_{0}=H$ . For the convenience of the reader, we provide the following standard proof.

Lemma 2.

For all $n\in\mathbb{N}$ , $X(n)=X_{n}$ .

Proof.

Using induction, the QR algorithm is described as

[TABLE]

Then we consider the QR factorization of $H^{n}$

[TABLE]

It then follows that $Q(n)=Q_{0}Q_{1}\cdots Q_{n-1}$ by the uniqueness of the QR factorization. Therefore $X(n)=X_{n}$ . ∎

Let $H=V\Lambda V^{*}$ , $\Lambda=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{N})$ be a333Note that $V$ is not uniquely defined. Furthermore, if the spectrum is not simple then $V$ is not even uniquely defined modulo phases. spectral decomposition of $H$ . Then define $U(t)=Q^{*}(t)V$ so that $X(t)=U(t)\Lambda U^{*}(t)$ . We first compute $U_{Nn}(t)$ , $n=1,2,\ldots,N$ by considering ( $e_{j}$ is the $j$ th canonical basis column vector and $U(0)=V$ )

[TABLE]

And so, to determine $R_{NN}>0$ , we sum over $n$ and use the normalization of the rows of $U(t)$ :

[TABLE]

When it comes to the choice of the function $f(X)$ in Algorithm 3, we first give two options that we do not analyze but are of great interest:

•

Compute the entire spectrum: $f(X)=\|X-\operatorname{diag}(X)\|_{\mathrm{F}}$ . Here $\operatorname{diag}(X)$ is a diagonal matrix containing just the diagonal of $X$ and $\|\cdot\|_{\mathrm{F}}$ is the Frobenius norm.

•

Deflation444Here $X(i:j,l:k)$ refers to the submatrix containing entries in rows $i$ through $j$ and columns $l$ through $k$ .:

[TABLE]

For our purposes here we choose $f(X)$ as

[TABLE]

This is the sum of the off-diagonal entries in the last row of $X$ . And so, if $f(X)$ is small then $X_{NN}$ is close to an eigenvalue of $X$ . Continuing,

[TABLE]

Remark 5.2.

It is worth emphasizing that $X(t)$ , the interpolation of the QR iterates $\{X_{n}\}$ , is the solution of a nonlinear differential equation [5]. Furthermore, in the real symmetric case, this is generically a system in $2[N^{2}/4]$ variables that is Hamiltonian and completely integrable. The eigenvalues of $X(0)=H$ , constitute $N$ of the $[N^{2}/4]$ integrals of the motion, i.e. the flow is, in particular, isopectral. The equations of motion are given by

[TABLE]

where $Y_{-}$ is the (strictly) lower-triangular part of $Y$ . The Hamiltonian is given by $H(X)=\operatorname{tr}(X(\log X-1))$ . See [5] and [30] for more details.

5.2.1. The halting time

We define the halting time for the QR algorithm as

[TABLE]

Note that we do not assume here that $t$ is an integer. The “true” halting time for the QR algorithm is $\lceil T_{\mathrm{QR},\epsilon}(H)\rceil$ but it will turn out that this has the same limiting distribution as $T_{\mathrm{QR},\epsilon}(H)$ .

The first step in the analysis of the QR algorithm is to write $E_{\mathrm{QR}}(t)$ as a sum of two positive parts, as follows. Define for $n\geq 1$

[TABLE]

Then

[TABLE]

It is clear that $\delta_{1}=1\geq\delta_{n}$ for all $n$ and we isolate this term:

[TABLE]

Heuristically, $E_{\mathrm{QR},1}(t)$ is quadratic in $\delta_{2}^{t}$ and $E_{\mathrm{QR},0}(t)$ is not. Therefore $E_{\mathrm{QR},0}(t)$ should provide the leading order behavior of $E_{\mathrm{QR}}(t)$ as $t\to\infty$ provided that the $\nu_{n}$ ’s are not too large. Note that by the Cauchy–Schwartz inequality, $E_{\mathrm{QR},1}(t)\geq 0$ .

6. Proofs of the main theorems

In order to prove our main theorems we take the following approach. The dynamics of the QR algorithm closely mirrors that of the so-called Toda algorithm and therefore many of the results of [8] apply directly. And to prove Theorem 1 for the QR algorithm we almost exclusively simply quote results from [8]. To prove Theorem 1 for the power and inverse power methods, we discuss the calculations in more detail.

For convenience let $\epsilon=N^{-\alpha/2}$ . Then Condition 2.1 takes the form

[TABLE]

with $\sigma>0$ and fixed.

6.1. Technical lemmas

We begin by modifying the technical lemmas from [8] as our formulae now depend on the ratio of eigenvalues as opposed to their differences in [8]. The main fact is that if a matrix $H$ satisfies Condition 3.3 with $0<s<1/5$ then so does $\log H$ with quantiles $\hat{\gamma}_{n}=\log\gamma_{n}$ provided $N$ is sufficiently large. Indeed for $0<s<\sigma/40$

[TABLE]

for $N$ sufficiently large. Applying Condition 3.3(5) with $s$ replaced by $s/2$ , we conclude that for $N$ sufficiently large $|\log\lambda_{n}-\hat{\gamma}_{n}|\leq N^{-2/3+s/2}(\min\{n,N-n+1\})^{-1/3}$ for all $n$ . Concerning Condition 3.3(3), note that for $N$ sufficiently large

[TABLE]

and one then proceeds as before. The proof of Condition 3.3(4) is similar.

Recall the notation $\delta_{n}=\lambda_{1}^{2}/\lambda_{n}^{2}$ and define $I_{c}=\{2\leq n\leq N:\delta_{n}\leq\delta_{2}^{1+c}\}$ for $c>0$ .

Lemma 3 ([8]).

Let $0<c<10/\sigma$ . Given Condition 3.3, then the cardinality of $I_{c}^{c}$ is given by

[TABLE]

for $N$ sufficiently large, where c denotes the compliment relative to $\{1,\ldots,N-1\}$ .

Recalling the notation $\nu_{n}=\beta_{n}^{2}/\beta_{1}^{2}$ , for matrices in $\mathcal{R}_{N,s}$ we have $\nu_{n}\leq N^{2s}$ and $\sum_{n}\nu_{n}=\beta_{1}^{-2}\leq N^{1+s}$ because $\sum_{n=1}^{N}\beta_{n}^{2}=\sum_{n=1}^{N}|\langle v,u_{j}\rangle|^{2}=\|Uv\|_{2}=\|v\|_{2}$ for the unitary matrix $U$ of eigenvectors. We also have the following result.

Lemma 4 ([8]).

Given Condition 3.3, $0<c<10/\sigma$ and $j\leq 3$ fixed there exists an absolute constant $C$ such that

[TABLE]

for $N$ sufficiently large.

6.2. Main estimates for the QR algorithm

The steps of the proof are the following:

(1)

a priori estimates on $T_{\mathrm{QR},\epsilon}$ that will hold with high probability, 2. (2)

a lower bound on $-E_{\mathrm{QR},0}^{\prime}(t)$ over a region determined in (1), 3. (3)

finding and estimating an approximation $T^{*}_{\mathrm{QR},\epsilon}$ of $T_{\mathrm{QR},\epsilon}$ , and 4. (4)

establishing that $T^{*}_{\mathrm{QR},\epsilon}$ converges in distribution and then using (1)-(3) to show that indeed $T^{*}_{\mathrm{QR},\epsilon}$ is close to $T_{\mathrm{QR},\epsilon}$ .

If $\epsilon$ is sufficiently small we expect $E_{\mathrm{QR},0}(t)$ to control the convergence of the algorithm. Consider

[TABLE]

and the approximation $T^{*}_{\mathrm{QR},\epsilon}$ of $T_{\mathrm{QR},\epsilon}$ is given by

[TABLE]

Thus

[TABLE]

To determine how close $T^{*}_{\mathrm{QR},\epsilon}$ is to $T_{\mathrm{QR},\epsilon}$ we use the following relation

[TABLE]

for some $\eta$ between $T^{*}_{\mathrm{QR},\epsilon}$ and $T_{\mathrm{QR},\epsilon}$ . So, we need to show that the left-hand side of (8) is small and $E_{\mathrm{QR},0}^{\prime}(\eta)$ is not too small. This is accomplished by following Lemmas 5-9 and Proposition 2. The the proofs of these results make heavy use of Lemmas 3 and 4.

Lemma 5 ([8], Lemma 2.1).

Given Condition 3.3, the halting time $T_{\mathrm{QR},\epsilon}$ for the QR algorithm satisfies

[TABLE]

for sufficiently large $N$ .

Define the interval

[TABLE]

Lemma 6 ([8], Lemma 2.2).

Given Condition 3.3 and $t\in L_{\alpha}$

[TABLE]

for sufficiently large $N$ .

The next estimate is immediate from the definition of $T^{*}_{\mathrm{QR},\epsilon}$

Lemma 7 ([8], Lemma 2.3).

Given Condition 3.3

[TABLE]

for sufficiently large $N$ , i.e. $T^{*}_{\mathrm{QR},\epsilon}\in L_{\alpha}$ .

Lemma 8 ([8], Lemma 2.4).

Given Conditions 3.2 and 3.3

[TABLE]

for sufficiently large $N$ .

Lemma 9 ([8], (2.6)).

Given Conditions 3.3, for $t\in L_{\alpha}$

[TABLE]

for sufficiently large $N$ .

Proposition 2.

Given Conditions 3.1 and 3.3 for $\sigma$ and $p$ fixed with $s$ sufficiently small (depending on $\sigma$ and $p$ )

[TABLE]

for $N$ sufficiently large.

Proof.

We use (8) to estimate the difference (for some $\eta$ between $T_{\mathrm{QR},\epsilon}$ and $T^{*}_{\mathrm{QR},\epsilon})$ and apply Lemmas 5, 6, 7, 8 and 9 to find

[TABLE]

Using the assumption that $\alpha>10/3$ the proposition follows. ∎

Thus far, no estimates had any probabalistic input. We now introduce the probabilistic considerations needed to prove our main theorem for the QR algorithm.

Theorem 5.

Let $H$ be an SCM. For $\alpha\geq 10/3+\sigma$ , $\sigma>0$

[TABLE]

Proof.

We first prove that the following three random variables converge to zero in probability:

[TABLE]

The proof for the first random variable follows [8, Lemma 3.1] and requires the specific use of Condition 3.1, the proof for the second follows [8, Lemma 3.4]. For the last, we write $\lambda_{j}=\lambda_{-}+N^{-2/3}\xi_{j}$ , $j=1,2$ where $(\xi_{1},\xi_{2})$ converges jointly in distribution. Let $B_{R}$ be the event where $\|(\xi_{1},\xi_{2})\|_{2}\leq R$ . Given $B_{R}$ , consider

[TABLE]

Then

[TABLE]

where the approximation is uniform as $N\to\infty$ (given $B_{R}$ ). For $\delta>0$ , $s>0$ (sufficiently small) using uniform convergence

[TABLE]

Letting $R\to\infty$ we establish that $X_{N}$ converges to zero in probability. Appealing to Definition 3 we finally have

[TABLE]

∎

6.3. Main estimates for the power/inverse power method

We now follow the same steps that were performed for the QR algorithm for the inverse power method. First, we establish $E_{\mathrm{IP},1}(t)\geq 0$ as given in (6).

Define $w_{n}=\delta_{n}^{t}\nu_{n}/\left(\sum_{n=2}^{N}\delta_{n}^{t}\nu_{n}\right)$ and use the notation $\mathbb{E}_{w}[\delta^{\alpha}]=\sum_{n=2}^{N}\delta_{n}^{\alpha}w_{n}$ . It follows that the non-negativity of $E_{\mathrm{IP},1}(t)$ is equivalent to

[TABLE]

From Jensen’s inequality for concave functions

[TABLE]

which gives

[TABLE]

The last inequality follows from another application of Jensen’s inequality (for convex functions).

Lemma 10.

Given Condition 3.3, the halting time $T_{\mathrm{IP},\epsilon}$ for the inverse power method satisfies

[TABLE]

for sufficiently large $N$ .

Proof.

Because $E_{\mathrm{IP},0}(t)$ and $E_{\mathrm{IP},1}(t)$ are both positive we know that if $E_{\mathrm{IP},0}(t)>\epsilon^{2}$ on the interval $[0,T]$ then $T_{\mathrm{IP},\epsilon}>T$ . We first estimate $E_{\mathrm{IP},0}(t)$ as follows:

[TABLE]

Then we set $t=a\log N/\log\delta_{2}^{-1}$ and use Lemma 4 to estimate the denominator

[TABLE]

To estimate the numerator we use Lemma 4 again

[TABLE]

Therefore

[TABLE]

Recall that $\alpha\geq 10/3+\sigma$ , $0<s<\sigma/40<1/120<1/5$ and assume that $a\leq\sigma/2$ . We have

[TABLE]

which is larger than $\epsilon^{2}\leq N^{-10/3-\sigma}$ for sufficiently large $N$ , and we conclude

[TABLE]

For $a\geq\sigma/2$ , note that

[TABLE]

Now choose $c<10/\sigma$ (cf. Lemma 4) so that $1+s-c\sigma/2<0$ , i.e., $c>2(1+s)/\sigma$ . Note that as $s<\sigma/40<1,~{}2(1+s)<10$ , such a $c$ exists. Furthermore, as $s<\sigma/40$ , it follows from the above inequality that there exists $C>1$ such that

[TABLE]

for sufficiently large $N$ . Then (again for $t=a\log N/\log\delta_{2}^{-1}$ )

[TABLE]

Now for $s<\sigma/40$ , we have $\alpha-4/3-5s>\sigma/2$ , and so for $\sigma/2<a<\alpha-4/3-5s$ , we again have $E_{\mathrm{IP},0}(t)\geq CN^{-\alpha+s}\geq\epsilon^{2}$ for sufficiently large $N$ . This establishes the lower bound on $T_{\mathrm{IP},\epsilon}$ .

To establish the upper bound on $T_{\mathrm{IP},\epsilon}$ we use the absolute boundedness of $\lambda_{1}^{-1}$ (given Condition 3.3: see also Remark 3.2) for $c>0$ , together with $\delta_{n}\leq 1$

[TABLE]

If $t=a\log N/\log\delta_{2}^{-1}$ with $a\geq(\alpha-4/3+6s)$ , then $N^{-a+5s-4/3}\leq N^{-\alpha-s}$ . But then $a\geq\alpha-4/3+6s\geq 2$ , and so, taking $c=2(<10/\sigma)$ , $1+s-ca\leq s-3$ . Hence $N^{-a+1+s-ca}\leq N^{-\alpha-s}$ . Hence $N^{-a+1+s-ca}\leq N^{-\alpha-5/3-5s}\leq N^{-\alpha-s}$ . Thus

[TABLE]

So, for these values of $t$ , $E_{\mathrm{IP},0}(t)<\epsilon^{2}$ for sufficiently large $N$ . Next, we show that the same holds for $E_{\mathrm{IP},1}(t)$ . We use the estimate with $c=2$ and any $\gamma$ , to obtain

[TABLE]

for $t\geq(\alpha-4/3+\gamma s)\log N/\log\delta_{2}^{-1}$ and $N$ sufficiently large. Then using $\lambda_{n}^{-1}\leq\lambda_{1}^{-1}\leq C$ (given Condition 3.3), and taking $\gamma=-5$ , $E_{\mathrm{IP},1}(t)\leq CN^{-2\alpha+8/3+18s}$ for $t\geq(\alpha-4/3-5s)\log N/\log\delta_{2}^{-1}$ and $N$ sufficiently large. Thus

[TABLE]

This shows that $T_{\mathrm{IP},\epsilon}\leq(\alpha-4/3+6s)\log N/\log\delta_{2}^{-1}$ for large $N$ . ∎

Remark 6.1.

We take $\gamma=-5$ , rather than $\gamma=6$ , for technical reasons, see Lemma 14 below.

Similar to the case of the QR algorithm, define

[TABLE]

Lemma 11.

Given Condition 3.3 and $t\in\hat{L}_{\alpha}$

[TABLE]

for $N$ sufficiently large.

Proof.

By direct calculation

[TABLE]

Then using (9) and $\log\delta_{2}^{-1}\geq CN^{-2/3-s/2}$ and keeping only the leading term

[TABLE]

for $t\in\hat{L}_{\alpha}$ . Define $G(t)$ by $-E_{\mathrm{IP},0}^{\prime}(t)=G(t)+F(t)$ . Then we use (10) and (11) with $c=2$ and $t=(\alpha-4/3-5s)\log N/\log\delta_{2}^{-1}$

[TABLE]

for $t\in\hat{L}_{\alpha}$ . The last inequality follows because $\alpha\geq 10/3+\sigma$ and $\sigma>40s$ . From here it follows that for $N$ sufficiently large

[TABLE]

∎

Our next step is to construct an approximation $T^{*}_{\mathrm{IP},\epsilon}$ of $T_{\mathrm{IP},\epsilon}$ . We write

[TABLE]

Define $T^{*}_{\mathrm{IP},\epsilon}$ by

[TABLE]

Lemma 12.

Given Condition 3.3, $T^{*}_{\mathrm{IP},\epsilon}\in\hat{L}_{\alpha}$ .

Proof.

Using Condition 3.3

[TABLE]

we find

[TABLE]

for sufficiently large $N$ , establishing the lemma. ∎

Lemma 13.

Given Conditions 3.1 and 3.3,

[TABLE]

for $t\in\hat{L}_{\alpha}$ and sufficiently large $N$ .

Proof.

By direct calculation

[TABLE]

Since the denominator is at least unity, it is enough to estimate the numerators. As $\lambda_{n}^{-1}+\lambda_{1}^{-1}\leq 2\lambda_{1}^{-1}$ ,

[TABLE]

For $c>0$ , define $\hat{I}_{c}=\{3\leq n\leq N:\delta_{n}\leq\delta_{2}^{1+c}\}$ . We estimate

[TABLE]

First,

[TABLE]

Here we used $\sum_{n}\frac{\nu_{n}}{\nu_{2}}=\beta_{2}^{-2}\leq N^{1+s}$ and estimated $(1-\delta_{N}^{1/2})/(1-\delta_{2}^{1/2})\leq C/(1-\delta_{2}^{1/2})\leq CN^{2/3+s}$ . If we set $c=2$ and use the inequality $\alpha-4/3-5s>2+\sigma-5s$ , then

[TABLE]

as $s<\sigma/40$ . Second, using that $\delta_{3}/\delta_{2}\geq\delta_{n}/\delta_{2}$ and Condition 3.2 along with $\delta_{n}>\delta_{2}^{1+c}$ , $n\not\in\hat{I}_{c}$ ( $c=2$ ), we consider

[TABLE]

by Condition 3.3, and so $S_{2}\leq CN^{4s-2p}$ . Since $p<1/2$ , we find

[TABLE]

for sufficiently large $N$ . By (12),

[TABLE]

for sufficiently large $N$ as $s<\sigma/40$ . We find

[TABLE]

This proves the lemma. ∎

The next lemma is a restatement of (13)

Lemma 14.

Given Condition 3.3, for $t\in\hat{L}_{\alpha}$

[TABLE]

for sufficiently large $N$ .

Proposition 3.

Given Conditions 3.1 and 3.3 for $\sigma<1/3$ and $p<\sigma/4$ , fixed, with $s<\sigma/40$

[TABLE]

for $N$ sufficiently large.

Proof.

We use the analog of (8) to estimate the difference $|T^{*}_{\mathrm{IP},\epsilon}-T_{\mathrm{IP},\epsilon}|$ (for some $\eta\in\hat{L}_{\alpha}$ ) and apply Lemmas 10, 11, 12, 13 and 14 to find

[TABLE]

The proposition follows. ∎

Now, we introduce probabilistic considerations as we did for the QR algorithm.

Theorem 6.

Let $H$ be an SCM and let $v$ be a random unit vector independent of $H$ . For $\alpha\geq 10/3+\sigma$ , $\sigma>0$

[TABLE]

Proof.

As was the case in the proof of Theorem 5, we show the following three random variables converge to zero in probability:

[TABLE]

We start with the first. For $\delta>0$

[TABLE]

Provided that $s<(2/15)p$ , on $\mathcal{R}_{N,s}$ , $N^{-2/3}|T^{*}_{\mathrm{IP},\epsilon}-T_{\mathrm{IP},\epsilon}|$ tends to zero uniformly. Then

[TABLE]

From Theorem 3, $\limsup_{N\to\infty}\mathbb{P}(\mathcal{R}^{c}_{N,s})=0$ , and letting $p\downarrow 0$ , using Theorem 4 we find

[TABLE]

For the second random variable:

[TABLE]

Again, we write $\lambda_{j}=\lambda_{-}+N^{-2/3}\xi_{j}$ , $j=1,2$ and let $B_{R}$ be the event where $\|(\xi_{1},\xi_{2})\|_{2}\leq R$ . Next let $H_{j,R}$ be event where $1/R\leq N\beta_{j}^{2}\leq R$ . It then follows for $\delta>0$ and sufficiently large $N$

[TABLE]

Therefore

[TABLE]

And because $(\xi_{1},\xi_{2}),N\beta_{1}^{2}$ and $N\beta_{2}^{2}$ converge in distribution, if we let $R\to\infty$ in (15) it follows that $\mathbb{P}(Y_{N}\geq\delta)=0$ . The convergence in probability of the last random variable follows directly from the proof of Theorem 5. Using Definition 3 we have

[TABLE]

∎

Finally, we establish the analogous theorem for the power method. Following Remark 5.1, we note that $E_{\mathrm{P}}(t)$ is defined by sending $\lambda_{j}\to\lambda_{j}^{-1}$ and $H^{-1}$ satisfies the same estimates as $H$ (Theorem 4 and Theorem 3). We have the following theorem.

Theorem 7.

Let $H$ be an SCM and let $v$ be a random unit vector independent of $H$ . For $\alpha\geq 10/3+\sigma$ , $\sigma>0$

[TABLE]

Appendix A Error analysis

In this section we establish that the halting times given above for the QR algorithm and the inverse power method are adaquate to acheive an order $\epsilon$ approximation of the smallest eigenvalue.

A.1. QR algorithm

The true error in the QR algorithm is

[TABLE]

Applying Lemma 4 (see also (10)) we find for $t\in L_{\alpha}$ , given Condition 3.3

[TABLE]

for sufficiently large $N$ . We obtain the following error estimate.

Proposition 4.

For $\alpha\geq 10/3+\sigma$ , $\epsilon=N^{-\alpha/2}$ ,

[TABLE]

converges to zero in probability, while

[TABLE]

converges to $\infty$ in probability.

Proof.

First, given Condition 3.3, $T_{\mathrm{QR},\epsilon}\in L_{\alpha}$ and for $\delta>0$

[TABLE]

It then follows that on $\mathcal{R}_{N,s}$ for sufficiently large $N$ and $s<\sigma/22$

[TABLE]

Therefore

[TABLE]

It then follows that on $\mathcal{R}_{N,s}$ for sufficiently large $N$ and $s<1/33$

[TABLE]

Therefore

[TABLE]

∎

Remark A.1.

Define the “true” halting time by

[TABLE]

We omit the details, but one can show that

[TABLE]

So that $T_{\mathrm{QR},\epsilon}^{\mathrm{True}}$ has the same limiting distribution as $T_{\mathrm{QR},\epsilon}$ .

A.2. Inverse power method

The true error for the inverse power method is also given by

[TABLE]

Following the calculations that led to (16), given Condition 3.3, for sufficiently large $N$ and $t\in\hat{L}_{\alpha}$ ,

[TABLE]

Here we are conservative with the factor on $s$ sot hat it mirrorw (16). An analogous formula holds for the power method. We arrive at the following propsition that is proved in the exact same way as Proposition 4.

Proposition 5.

For $\alpha\geq 10/3+\sigma$ , $\epsilon=N^{-\alpha/2}$ ,

[TABLE]

converge to zero in probability, while

[TABLE]

converge to $\infty$ in probability.

Remark A.2.

Following, Remark A.1 define the “true” halting time by

[TABLE]

Again, omitting the details, one can show that

[TABLE]

So that $T_{\mathrm{IP},\epsilon}^{\mathrm{True}}$ has the same limiting distribution as $T_{\mathrm{IP},\epsilon}$ . This further justifies the definition of $T_{\mathrm{IP},\epsilon}$ .

Appendix B Asymptotic normality of the eigenvector projections

This section presents the estimates that are required to prove Theorem 1 for the power and inverse power methods when the initial unit vector $v$ is chosen randomly.

Theorem 8.

Let $v=v_{N}\in\mathbb{R}^{N}$ ( $\beta=1$ ) or $v=v_{N}\in\mathbb{C}^{N}$ ( $\beta=2$ ) be a unit vector555To be precise about this, fix a semi-infinite vector $w=(w_{1},w_{2},\ldots,w_{N},\ldots)$ . Then for $y_{N}:=(w_{1},\ldots,w_{N})$ define $v_{N}=y_{N}/\|y_{N}\|_{2}$ . and fix $j>0$ . Let $u_{j}$ and $u_{N-j+1}$ be the eigenvectors of an SCM corresponding to $\lambda_{j}$ and $\lambda_{N-j+1}$ . Then for any bounded, continuous function $h:\mathbb{R}\to\mathbb{R}$

[TABLE]

where $G_{\beta}$ is either a standard normal ( $\beta=1$ ) or a standard complex ( $\beta=2$ ) random variable. That is, we have convergence in distribution to $|G_{\beta}|$

Proof.

We present the proof for $\beta=1$ and $u_{j}$ as the other cases are completely analogous. From [3, Theorem 8.2] it follows that

[TABLE]

where $\mathbb{E}_{\mathrm{W}}$ is the expectation with respect to the Wishart (LOE) ensemble ( $X_{ij}$ are iid standard normal random variables). And so, it is enough to prove the statement for the Wishart ensemble. In this case it is well known that the eigenvectors are distributed with Haar measure on the orthogonal group. Let $Y=(Y_{1},\ldots,Y_{N})^{T}$ be a vector of iid standard normal random variables. It follows that

[TABLE]

and $\langle v,Y\rangle$ is a standard normal random variable. And, so it suffices to show that

[TABLE]

Indeed, if the difference of two random variable converges to zero in probability and the first converges in distribution then so does the second (to the same distribution). Fix $\delta>0$ and consider for $R>0$

[TABLE]

A consequence of the Strong Law of Large Numbers (SLLN) is that the latter term tends to zero as $N\to\infty$ : The SLLN implies that $\frac{\|Y\|^{2}_{2}}{N}\to 1$ a.s., hence $\frac{N^{1/2}}{\|Y\|_{2}}\to 1$ a.s. and therefore $\frac{N^{1/2}}{\|Y\|_{2}}\to 1$ in probability. Then letting $R\to\infty$ , (17) follows. ∎

Corollary 1.

Theorem 8 holds when $v$ is a random unit vector, independent of the given SCM. In this case $\mathbb{E}(\cdot)$ should be understood as the expectation with respect to both the distribution on $v$ and the SCM.

Proof.

We express $\mathbb{E}=\mathbb{E}_{v}\mathbb{E}_{\mathrm{SCM}}$ . Let $h$ be a bounded, continuous function $h:\mathbb{R}\to\mathbb{R}$ . Then Theorem 8 states

[TABLE]

By the bounded convergence theorem

[TABLE]

as $N\to\infty$ , and the corollary follows. ∎

Proposition 6.

Given an SCM, let $v$ be a random666This also holds for deterministic $v$ . unit vector independent of the SCM. Define $\beta_{j}=|\langle v,u_{j}\rangle|$ , $j=1,2,\ldots,N$ where $u_{j}$ is the $j$ th eigenvector of the SCM. Fix $s>0$ and let $\mathcal{U}_{N,s}$ be the set of matrices where

•

$\beta_{j}\leq N^{-1/2+s/2}$ * for all $1\leq j\leq N$ , and*

•

$\beta_{j}\geq N^{-1/2-s/2}$ * for $j=1,2,3,N-2,N-1,N$ .*

Then $\mathbb{P}(\mathcal{U}_{N,s})=1+o(1)$ as $N\to\infty$ , i.e. these conditions hold with high probability.

Proof.

The delocalization result [3, Theorem 2.17] states that for deteriministic unit vectors $v$ and all $s,D>0$ :

[TABLE]

This implies

[TABLE]

So, let $D>1$ . Stated another way,

[TABLE]

uniformly in $v$ . And so, taking an expectation with respect to the law of $v$ we find that

[TABLE]

Then for $j=1,2,3,N-2,N-1,N$

[TABLE]

follows from Corollary 1 after applying Lemma 1.

∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. D. Bai, B. Q. Miao, and G. M. Pan. On asymptotics of eigenvectors of large sample covariance matrix. Ann. Probab. , 35(4):1532–1572, 2007.
2[2] G Ben Arous and S Péché. Universality of local eigenvalue statistics for some sample covariance matrices. Commun. Pure Appl. Math. , 58(10):1316–1357, oct 2005.
3[3] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal components of sample covariance matrices. Probab. Theory Relat. Fields , 164(1-2):459–552, feb 2016.
4[4] K H Borgwardt. The simplex method: A probabilistic analysis . Springer–Verlag, Berlin, Heidelberg, 1987.
5[5] P Deift, T Nanda, and C Tomei. Ordinary differential equations and the symmetric eigenvalue problem. SIAM J. Numer. Anal. , 20:1–22, 1983.
6[6] P A Deift, G Menon, S Olver, and T Trogdon. Universality in numerical computations with random data. Proc. Natl. Acad. Sci. U. S. A. , 111(42):14973–8, oct 2014.
7[7] P A Deift, G Menon, and T Trogdon. On the condition number of the critically-scaled Laguerre Unitary Ensemble. Discret. Contin. Dyn. Syst. , 36(8):4287–4347, mar 2016.
8[8] Percy Deift and Thomas Trogdon. Universality for the Toda algorithm to compute the eigenvalues of a random matrix. ar Xiv Prepr. ar Xiv 1604.07384 , apr 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Universality for eigenvalue algorithms on sample covariance matrices

Abstract.

Key words and phrases:

2000 Mathematics Subject Classification:

1. Introduction

1.1. Relation to previous work and complexity theory

1.2. Open questions

2. Main results

Condition 2.1**.**

Theorem 1** (Universality).**

Proposition 1**.**

3. Results from random matrix theory

Definition 1** (Sample covariance matrix (SCM)).**

Remark 3.1**.**

Definition 2**.**

Condition 3.1**.**

Condition 3.2**.**

Condition 3.3**.**

Remark 3.2**.**

Theorem 2**.**

Proof.

Definition 3**.**

Lemma 1**.**

Theorem 3**.**

Proof.

Theorem 4**.**

Proof.

4. A numerical demonstration

5. Fundamentals of the algorithms

5.1. The power and inverse power methods

5.1.1. The halting time

Remark 5.1**.**

5.2. The QR (eigenvalue) algorithm

Lemma 2**.**

Proof.

Remark 5.2**.**

5.2.1. The halting time

6. Proofs of the main theorems

6.1. Technical lemmas

Lemma 3** ([8]).**

Lemma 4** ([8]).**

6.2. Main estimates for the QR algorithm

Lemma 5** ([8], Lemma 2.1).**

Lemma 6** ([8], Lemma 2.2).**

Lemma 7** ([8], Lemma 2.3).**

Lemma 8** ([8], Lemma 2.4).**

Lemma 9** ([8], (2.6)).**

Proposition 2**.**

Proof.

Theorem 5**.**

Proof.

6.3. Main estimates for the power/inverse power method

Lemma 10**.**

Proof.

Remark 6.1**.**

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Lemma 13**.**

Proof.

Lemma 14**.**

Proposition 3**.**

Proof.

Theorem 6**.**

Proof.

Theorem 7**.**

Appendix A Error analysis

A.1. QR algorithm

Proposition 4**.**

Proof.

Remark A.1**.**

A.2. Inverse power method

Condition 2.1.

Theorem 1 (Universality).

Proposition 1.

Definition 1 (Sample covariance matrix (SCM)).

Remark 3.1.

Definition 2.

Condition 3.1.

Condition 3.2.

Condition 3.3.

Remark 3.2.

Theorem 2.

Definition 3.

Lemma 1.

Theorem 3.

Theorem 4.

Remark 5.1.

Lemma 2.

Remark 5.2.

Lemma 3 ([8]).

Lemma 4 ([8]).

Lemma 5 ([8], Lemma 2.1).

Lemma 6 ([8], Lemma 2.2).

Lemma 7 ([8], Lemma 2.3).

Lemma 8 ([8], Lemma 2.4).

Lemma 9 ([8], (2.6)).

Proposition 2.

Theorem 5.

Lemma 10.

Remark 6.1.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Proposition 3.

Theorem 6.

Theorem 7.

Proposition 4.

Remark A.1.

Proposition 5.

Remark A.2.

Theorem 8.

Corollary 1.

Proposition 6.