The conjugate gradient algorithm on well-conditioned Wishart matrices is   almost deterministic

Percy Deift; Thomas Trogdon

arXiv:1901.09007·math.NA·October 4, 2019

The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic

Percy Deift, Thomas Trogdon

PDF

TL;DR

This paper demonstrates that for large well-conditioned Wishart matrices, the conjugate gradient algorithm's iteration count becomes nearly deterministic, with error and residual norms converging rapidly in probability and almost surely.

Contribution

It establishes that the iteration count for conjugate gradient on large Wishart matrices is almost deterministic, providing explicit convergence results.

Findings

01

Iteration count concentrates around a deterministic value

02

Error and residual norms decay exponentially fast

03

Convergence occurs in probability, mean, and almost surely

Abstract

We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices $W = X X^{*}$ where $X$ is $n \times m$ and $n / m \sim d$ for $0 < d < 1$ . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration count, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.

Equations502

∥ x - x_{k} ∥_{W} = y \in X_{k} min ∥ y - x ∥_{W},

∥ x - x_{k} ∥_{W} = y \in X_{k} min ∥ y - x ∥_{W},

X_{k} = x_{0} + span {r_{0}, W r_{0}, \dots, W^{k - 1} r_{0}}, ∥ y ∥_{W}^{2} = y^{*} W y, r_{0} = b - W x_{0} .

∥ e_{k} ∥_{W} and ∥ r_{k} ∥_{2} = ∥ e_{k} ∥_{W^{2}}, r_{k} (W, b) = r_{k} = b - W x_{k},

∥ e_{k} ∥_{W} and ∥ r_{k} ∥_{2} = ∥ e_{k} ∥_{W^{2}}, r_{k} (W, b) = r_{k} = b - W x_{k},

t_{ϵ}^{(1)} (W, b) t_{ϵ}^{(2)} (W, b) = min {k : ∥ e_{k} ∥_{W} < ϵ}, = min {k : ∥ r_{k} ∥_{2} < ϵ} .

t_{ϵ}^{(1)} (W, b) t_{ϵ}^{(2)} (W, b) = min {k : ∥ e_{k} ∥_{W} < ϵ}, = min {k : ∥ r_{k} ∥_{2} < ϵ} .

W = X X^{*} / m,

W = X X^{*} / m,

∥ r_{k} (W, b) ∥_{2} ⟶ almost surely d^{k /2} .

∥ r_{k} (W, b) ∥_{2} ⟶ almost surely d^{k /2} .

∥ e_{k} (W, b) ∥_{W} ⟶ almost surely \frac{d ^{k /2}}{1 - d} .

∥ e_{k} (W, b) ∥_{W} ⟶ almost surely \frac{d ^{k /2}}{1 - d} .

n \to \infty lim

n \to \infty lim

n \to \infty lim

n \to \infty lim P (t_{ϵ}^{(2)} (W, b) > k) = 1.

n \to \infty lim P (t_{ϵ}^{(2)} (W, b) > k) = 1.

[(1 - d)^{2}, (1 + d)^{2}],

[(1 - d)^{2}, (1 + d)^{2}],

W = X X^{*} / m

W = X X^{*} / m

⟨ ∥ e_{k} (W, b) ∥_{W} ⟩ and \frac{d ^{k /2}}{1 - d}

⟨ ∥ e_{k} (W, b) ∥_{W} ⟩ and \frac{d ^{k /2}}{1 - d}

⟨ ∥ r_{k} (W, b) ∥_{2} ⟩ compared with d^{k /2} .

⟨ ∥ r_{k} (W, b) ∥_{2} ⟩ compared with d^{k /2} .

∥ e_{k} (W, b) ∥_{W} \leq 2 [(\frac{κ - 1}{κ + 1})^{k} + (\frac{κ - 1}{κ + 1})^{- k}]^{- 1} ∥ e_{0} (W, b) ∥_{W},

∥ e_{k} (W, b) ∥_{W} \leq 2 [(\frac{κ - 1}{κ + 1})^{k} + (\frac{κ - 1}{κ + 1})^{- k}]^{- 1} ∥ e_{0} (W, b) ∥_{W},

∥ e_{k} (W, b) ∥_{W} ≲ 2 [d^{k /2} + d^{- k /2}]^{- 1} ∥ e_{0} (W, b) ∥_{W},

∥ e_{k} (W, b) ∥_{W} ≲ 2 [d^{k /2} + d^{- k /2}]^{- 1} ∥ e_{0} (W, b) ∥_{W},

∥ e_{k} (W, b) ∥_{W} ≲ 2 d^{k /2} ∥ e_{0} (W, b) ∥_{W} .

∥ e_{k} (W, b) ∥_{W} ≲ 2 d^{k /2} ∥ e_{0} (W, b) ∥_{W} .

W_{n, β, d} := \frac{1}{β m} X X^{*}

W_{n, β, d} := \frac{1}{β m} X X^{*}

μ_{n, β, d} = \frac{1}{n} j = 1 \sum n δ_{λ_{j} (n, β, d)}

μ_{n, β, d} = \frac{1}{n} j = 1 \sum n δ_{λ_{j} (n, β, d)}

\int f (λ) E μ_{n, β, d} (d λ) := E (\int f (λ) μ_{n, β, d} (d λ))

\int f (λ) E μ_{n, β, d} (d λ) := E (\int f (λ) μ_{n, β, d} (d λ))

ρ_{MP, d} (x) = \frac{1}{2 π d} \frac{∣ ( d _{+} - x ) ( x - d _{-} ) ∣}{x} \mathbbm 1_{[d_{-}, d_{+}]} (x), d_{\pm} = (1 \pm d)^{2} .

ρ_{MP, d} (x) = \frac{1}{2 π d} \frac{∣ ( d _{+} - x ) ( x - d _{-} ) ∣}{x} \mathbbm 1_{[d_{-}, d_{+}]} (x), d_{\pm} = (1 \pm d)^{2} .

U W_{n, β, d} U^{*} = dist. W_{n, β, d} .

U W_{n, β, d} U^{*} = dist. W_{n, β, d} .

[H_{n, β, d} 0] :

[H_{n, β, d} 0] :

= ζ_{11} ζ_{21} 0 ζ_{22} ⋱ \dots 0 ⋱ ζ_{n, n - 1} \dots ζ_{nn} 0 \dots 000

H_{n, β, d} = dist. χ_{β m} χ_{β (n - 1)} χ_{β (m - 1)} χ_{β (n - 2)} χ_{β (m - 2)} ⋱ ⋱ χ_{β} χ_{β (m - n + 1)}

H_{n, β, d} = dist. χ_{β m} χ_{β (n - 1)} χ_{β (m - 1)} χ_{β (n - 2)} χ_{β (m - 2)} ⋱ ⋱ χ_{β} χ_{β (m - n + 1)}

(T_{d})_{ij} := n \to \infty lim \frac{1}{β m} E [(H_{n, β, d} H_{n, β, d}^{*})_{ij}], 1 \leq i, j .

(T_{d})_{ij} := n \to \infty lim \frac{1}{β m} E [(H_{n, β, d} H_{n, β, d}^{*})_{ij}], 1 \leq i, j .

T_{d}

T_{d}

H_{d}

W x = b, W > 0

W x = b, W > 0

∥ e_{k} ∥_{W} = ∥ p_{k}^{†} (W) e_{0} ∥_{W} = p_{k} \in P_{k}^{(0)} min ∥ p_{k} (W) e_{0} ∥_{W},

∥ e_{k} ∥_{W} = ∥ p_{k}^{†} (W) e_{0} ∥_{W} = p_{k} \in P_{k}^{(0)} min ∥ p_{k} (W) e_{0} ∥_{W},

T = T (W, y_{1}) = α_{1} β_{1} β_{1} α_{2} ⋱ ⋱ ⋱ β_{n - 1} β_{n - 1} α_{n}

T = T (W, y_{1}) = α_{1} β_{1} β_{1} α_{2} ⋱ ⋱ ⋱ β_{n - 1} β_{n - 1} α_{n}

p_{k}^{†} (λ) = \frac{det ( T _{k} ( W , \frac{r _{0}}{∥ r _{0} ∥} ) - λ I )}{det T _{k} ( W , \frac{r _{0}}{∥ r _{0} ∥} )},

p_{k}^{†} (λ) = \frac{det ( T _{k} ( W , \frac{r _{0}}{∥ r _{0} ∥} ) - λ I )}{det T _{k} ( W , \frac{r _{0}}{∥ r _{0} ∥} )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic

Percy Deift

New York University

Courant Institute of Mathematical Sciences

251 Mercer St.

New York, NY 10012

[email protected]

and

Thomas Trogdon

University of Washington

Department of Applied Mathematics

Seattle, WA 98195-3925

[email protected]

Abstract.

We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices $W=XX^{*}$ where $X$ is $n\times m$ and $n/m\sim d$ for $0<d<1$ . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration count, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.

2010 Mathematics Subject Classification:

Primary: 65F10, 60B20

We are grateful for discussions with Elliot Paquette, Joel Tropp and Roman Vershynin that have greatly improved the paper. This work was supported in part by NSF DMS-1300965 (PD) and NSF DMS-1753185, DMS-1945652 (TT)

1. Introduction

The conjugate gradient algorithm (CGA) [HS52] is arguably the most effective iterative method from numerical linear algebra. In exact arithmetic, the algorithm requires at most $n$ iterations to solve a $n\times n$ positive-definite linear system and it often requires many less iterations to compute a good approximate solution. It is exceedingly simple to implement and there are well-known error bounds available. And, despite the fact that the CGA is sensitive to round-off errors these error bounds still effectively hold for floating point arithmetic [Gre89]. While we present the algorithm in full below (see Algorithm 2.1), the variational characterization of the method is summarized as follows: Consider the linear system $W\textbf{x}=\textbf{b}$ , $W>0$ . Given an initial guess $\textbf{x}_{0}$ , find the unique vector $\textbf{x}_{k}$ that satisfies

[TABLE]

At each step $k$ of the iteration one can easily construct $\textbf{x}_{k}$ and the algorithm itself computes $\textbf{r}_{k}=\textbf{b}-W\textbf{x}_{k}$ , $k=0,1,2,\ldots,n$ . One has to then determine a computable stopping criterion, and typically, the algorithm is halted when $\|\textbf{r}_{k}\|_{2}<\epsilon$ , $\|\textbf{y}\|^{2}_{2}=\textbf{y}^{*}\textbf{y}$ , for a chosen error tolerance $\epsilon.$

Here we focus on two main measures of the error, $\textbf{e}_{k}(W,\textbf{b})=\textbf{e}_{k}:=\textbf{x}-\textbf{x}_{k}$ :

[TABLE]

in the particular case when $\textbf{x}_{0}=0$ . The associated halting times are

[TABLE]

We emphasize the importance of analyzing both quantities because $\textbf{r}_{k}$ is what is observed throughout the iteration and, of course, $\textbf{e}_{k}$ is the true error.

Our results (Theorems 3.1, 3.2, 3.3) are derived for both real and complex Gaussian matrices 111We can also easily extend the results to the case of quarternion entries.. We assume

[TABLE]

where $X$ is an $n\times m$ matrix whose entries are iid real or complex standard normal random variables. This is the real or complex Wishart distribution. Suppose further that $m=\lfloor n/d\rfloor$ for $0<d\leq 1$ (Note that if $d>1$ , i.e. $m<n$ , then $W$ is singular and $W\textbf{x}=\textbf{b}$ does not have a unique solution). Then if b is a random unit vector, independent of $W$ , our results show that as $n\to\infty$

[TABLE]

If $d<1$ then as $n\to\infty$

[TABLE]

Furthermore, there are discrete sets $S_{d}^{(1)}$ and $S_{d}^{(2)}$ with the property that if $\epsilon>0$ is in the complement of these sets, $\epsilon$ fixed, then

[TABLE]

Therefore, the halting time becomes effectively deterministic. We also present estimates that demonstrate that the probability that the errors $\textbf{e}_{k}$ deviate from their means decays exponentially with respect $n$ . In the case $d=1$ , a consequence of our results is that for any fixed $k>0$ and $\epsilon<1$ ,

[TABLE]

*Remark 1.1**.*

It is important to point out that $W$ in (1.2) is not necessarily a near-identity matrix. Indeed as $n\to\infty$ , the eigenvalues of $W$ typically lie in the interval

[TABLE]

and have an asymptotic density given by the famous Marchenko–Pastur law, see Definition 2.1. For finite $n$ , some of the eigenvalues of $W$ lie outside this interval, and the control of these eigenvalues plays a crucial role in the proofs of Theorems 3.1, 3.2 and 3.3 (see, for example, the proofs of (4.5) and (4.9)).

Our proofs make critical use of the invariance of the Wishart distribution and the relation between Householder bidiagonalization and the Lanczos iteration. This allows one to use classical estimates on chi-distributed random variables in a crucial way. The specific tools and results we incorporate from random matrix theory include global eigenvalue estimates [DS01], the convergence of the empirical spectral measure [BMP07] and the central limit theorem for linear statistics [LP09].

The remainder of the paper is setup as follows. In Section 1.1 we compare our analysis with facts already known about the conjugate gradient algorithm. We also demonstrate our results with numerical examples. In Section 2 we introduce our random matrix ensembles, the basic definitions from random matrix theory and review the Householder bidiagonalization procedure applied to these ensembles. We also review the connections between the conjugate gradient algorithm, the Lanczos iteration and the Householder bidiagonalization procedure. In Section 3 we present our main theorems. In Section 4 we introduce the results from probability and random matrix theory that are required to prove our theorems. In Section 5 we give the proofs of the theorems.

1.1. Comparison and demonstration

We now give a demonstration and discussion of the results. In what follows $\langle\cdot\rangle$ denotes the sample average of a random variable using $20,000$ samples. We will refer to the matrix

[TABLE]

where $X$ is an $n\times m$ matrix, having iid entries, $X_{11}=\pm 1$ with equal probability, as the Bernoulli ensemble (BE).

1.1.1. A numerical demonstration

To demonstrate our main results, in Figure 1 we plot the following quantities as a function of $k$ for different values of $n$

[TABLE]

with error bars that indicate where $99.9$ % of the samples lie. In Figure 2 we plot the same statistics for

[TABLE]

Both Figure 1 and Figure 2 demonstrate the concentration of the errors about their means. We demonstrate the limiting behavior of the halting times $t_{\epsilon}^{(j)}$ in Figure 3.

In all of these figures we have included computations for distributions of random matrices, in particular the Bernoulli ensemble, that are beyond the class for which our results apply. Nonetheless, it is clear that the behavior persists. This universality will be investigated in future work.

1.1.2. Relation to previous work

The classical error estimate for the CGA is [HS52, Gre89]

[TABLE]

where $\kappa=\lambda_{1}/\lambda_{n}$ is the condition number of $W$ . Here $\lambda_{1}\geq\cdots\geq\lambda_{n}>0$ are the eigenvalues of $W$ . It is a classical result in random matrix theory [BY93] that the condition number of (1.2) converges almost surely to $\frac{(1+\sqrt{d})^{2}}{(1-\sqrt{d})^{2}}$ . Roughly, one then obtains

[TABLE]

which is often just simplified to

[TABLE]

This overestimates the actual error by just a factor of 2.

In [MT16], the authors used (1.3) and tail bounds on the condition number to estimate the halting times (1.1) in the case $d=1+o(1)$ . A key observation was that the actual number of iterations appears to be of the same asymptotic order as the estimate obtained using (1.1). This is something that will indeed be true if the error estimate used decays exponentially and turns out to be an overestimate by a constant factor.

*Remark 1.2**.*

Of particular interest is this case where $d$ depends on $n$ and $d\to 1$ as $n\to\infty$ . For example, $d=1-1/n^{-1/2}$ was seen in [DMT16, DMOT14] to produce universal fluctuations for the halting times. Similarly, one would want to treat the case $\epsilon=\epsilon(n)\to 0$ as $n\to\infty.$

*Remark 1.3**.*

Our calculations in this work apply only to matrices with Gaussian entries. An important question, one of universality, is if our results hold when this assumption is relaxed. Indeed, one expects this to be true by the computations in Figures 1, 2 and 5 and the wealth of theoretical universality results from random matrix theory [PY14, BKYY16, BMP07].

2. The bidiagonalization of Wishart matrices and invariance

Definition 2.1.

For $0<d\leq 1$ set $m=\lfloor n/d\rfloor$ . Let $X$ be an $n\times m$ matrix of iid standard normal random variables ( $\beta=1$ ) or $X=X_{1}+\operatorname{i}X_{2}$ where $X_{1}$ and $X_{2}$ are independent copies of an $n\times m$ matrix of iid standard normal random variables ( $\beta=2$ ). Then

[TABLE]

has the $\beta$ -Wishart distribution. The associated empirical spectral measure (ESM) is given by

[TABLE]

where $\lambda_{1}(n,\beta,d)\geq\lambda_{2}(n,\beta,d)\geq\cdots\geq\lambda_{n}(n,\beta,d)$ are the eigenvalues of $W_{n,\beta,d}$ . Define the averaged EMS $\mathbb{E}\mu_{n,\beta}$ (or density of states) by

[TABLE]

for every222 $C_{b}(\mathbb{R}^{+})$ denotes bounded continuous functions on $[0,\infty)$ . $f\in C_{b}(\mathbb{R}^{+})$ .

Definition 2.2.

Marchenko–Pastur law $\mu_{\mathrm{MP},d}$ on $\mathbb{R}$ is given by the density

[TABLE]

The relation of the Marchenko–Pastur law to the eigenvalues of a Wishart matrix is given in the following section. But we demonstrate this relationship in Figure 4.

The Wishart distribution is invariant under orthogonal ( $\beta=1$ ) or unitary ( $\beta=2$ ) conjugation. Using $\beta=2$ , this means that if $U$ is a random unitary matrix that is independent of $W_{n,\beta,d}$ then

[TABLE]

For $W_{n,\beta,d}=\frac{1}{\beta m}XX^{*}$ the Householder bidiagionalization procedure [TBI97] operates on $X$ on the left and the right with $n\times n$ Householder reflections $R_{1},R_{2},\ldots,R_{n}$ , and $m\times m$ Householder reflections $\tilde{R}_{1},R_{2},\ldots,\tilde{R}_{n}$ , so that

[TABLE]

where all entries are non-negative. Because of invariance, $\{\zeta_{ij}\}$ are independent $\chi$ -distributed random variables, see [DE02] and the references therein. Specifically,

[TABLE]

where all entries are independent. Define the infinite matrix $\mathbb{T}_{d}$ by the entry-wise limit

[TABLE]

Therefore

[TABLE]

Lastly, define $\mathbb{T}_{k,d}$ to be the upper-left $k\times k$ submatrix of $\mathbb{T}_{d}$ .

2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm

The conjugate gradient algorithm (CGA) for the iterative solution of

[TABLE]

is given by

Algorithm 1: Conjugate Gradient Algorithm

(1)

$\textbf{x}_{0}$ is the initial guess.

(2)

Set $\textbf{r}_{0}=\textbf{b}-W\textbf{x}_{0}$ , $\textbf{p}_{0}=\textbf{r}_{0}$ .

(3)

For $k=1,2,\ldots,n$

(a)

Compute $\displaystyle a_{k-1}=\frac{\textbf{r}_{k-1}^{*}\textbf{r}_{k-1}}{\textbf{r}_{k-1}^{*}W\textbf{p}_{k-1}}$ .

(b)

Set $\textbf{x}_{k}=\textbf{x}_{k-1}+a_{k-1}\textbf{p}_{k-1}$ .

(c)

Set $\textbf{r}_{k}=\textbf{r}_{k-1}-a_{k-1}W\textbf{p}_{k-1}$ .

(d)

Compute $\displaystyle b_{k-1}=-\frac{\textbf{r}_{k}^{*}\textbf{r}_{k}}{\textbf{r}_{k-1}^{*}\textbf{r}_{k-1}}$ .

(e)

Set $\textbf{p}_{k}=\textbf{r}_{k}-b_{k-1}\textbf{p}_{k-1}$ .

The error at step $k$ is given by $\textbf{e}_{k}=\textbf{x}-\textbf{x}_{k}$ . Define the norm $\|\textbf{y}\|_{W}^{2}=\textbf{e}_{k}^{*}W\textbf{e}_{k}$ . A variational characterization of the CGA is that

[TABLE]

where $\mathbb{P}_{k}^{(0)}=\{p:p~{}\text{ is a polynomial of degree$ k $},~{}p(0)=1\}$ . The unique minimizer $p_{k}^{\dagger}$ in $\mathbb{P}_{k}^{(0)}$ can be described by the Lanczos algorithm. The Lanczos algorithm is a tridiagonalization algorithm given by

Algorithm 2: Lanczos Iteration

(1)

$\textbf{y}_{1}$ is the initial vector. Suppose $\|\textbf{y}_{1}\|_{2}^{2}=\textbf{y}_{1}^{*}\textbf{y}_{1}=1$

(2)

Set $\beta_{0}=0$

(3)

For $k=1,2,\ldots,n$

(a)

Compute $\displaystyle\alpha_{k}=(W\textbf{y}_{k}-\beta_{k-1}\textbf{y}_{k-1})^{*}\textbf{y}_{k}$ .

(b)

Set $\textbf{v}_{k}=W\textbf{y}_{k}-\alpha_{k}\textbf{y}_{k}-\beta_{k-1}\textbf{y}_{k-1}$ .

(c)

Compute $\beta_{k}=\|\textbf{v}_{k}\|_{2}$ and if $\beta_{k}\neq 0$ , set $\textbf{y}_{k+1}=\textbf{v}_{k}/\beta_{k}$ .

The Lanczos algorithm produces a tridiagonal matrix $T$

[TABLE]

and $T=QWQ^{*}$ for some unitary matrix $Q$ . We use $T_{k}=T_{k}(W,\textbf{y}_{1}),k=1,2,\ldots,n$ to denote the upper-left $k\times k$ subblock of $T$ . Then, it is well-known that (see [Gre89], for example)

[TABLE]

where $\textbf{r}_{0}=\textbf{b}-W\textbf{x}_{0}$ as above. Then, write $W=U\Lambda U^{*},~{}\Lambda=\mathrm{diag}(\lambda_{1},\lambda_{2},\ldots,\lambda_{n})$ and for $\textbf{x}_{0}=\textbf{0}$ , so that $\textbf{e}_{0}=\textbf{x}$ ,

[TABLE]

Now, consider the special case $\textbf{b}=\textbf{b}_{0}:=[1,0,\ldots,0]^{T}$ , so that $\textbf{r}_{0}=\textbf{b}_{0}$ . We further analyze the relation $Q^{*}TQ=W$ with $\textbf{y}_{1}=\textbf{b}_{0}$ . The Lanczos algorithm gives the matrix representation of $W$ in the orthonormal basis found by applying the Gram–Schmidt procedure to the sequence

[TABLE]

So, the first vector is $\textbf{b}_{0}$ , and so the first column of $Q$ is $\textbf{b}_{0}$ . The main consequence of this is that that first components of the eigenvectors333This is true modulo permutations and normalizations. of $W$ are the same as those of $T$ .

Basic assumption: Henceforth, throughout the paper we will assume $\textbf{x}_{0}=\textbf{0}$ .

Finally, we make a simple observation that the Householder bidiagonalization procedure [TBI97] applied to $X$ where $W=XX^{*}$ (or Householder tridiagonalization applied to $W$ ) leaves the eigenvalues of $W$ unchanged and also leaves the first components of the eigenvectors unchanged. So, provided that Lanczos completes ( $\beta_{k}\neq 0$ for $k=1,\ldots,n-1$ ), the Householder bidiagonalization must produce $T(W,\textbf{b}_{0})$ . This is indeed true because a Jacobi matrix is uniquely defined by eigenvalues and first-components of eigenvectors (see, e.g. [DLT85]).

3. Main results

The proof of our main theorem is given in Section 5. The convention used in this paper is that $\beta$ and $d$ are fixed constants. The symbols $C,c,C^{\prime},c^{\prime}$ with an assortment of subscripts will be used to denote constants and their (possible) dependencies. We suppress any dependence of these constants on $\beta$ but include dependence on $d$ , with a view to forthcoming work where we will allow $d$ to vary.

Theorem 3.1.

Assume the conjugate gradient algorithm is applied to solve $W_{n,\beta,d}\textbf{x}=\textbf{b}$ where $\|\textbf{b}\|_{2}=1$ is a (possibly) random vector, independent of $W_{n,\beta,d}=W$ and $0<d<1$ . Let $\textbf{e}_{k}=\textbf{x}-\textbf{x}_{k}$ , $k=0,1,2,\ldots$ be the associated error vectors.

(1)

For any fixed $\ell\in\mathbb{Z}$ and $n>1$ there exists a constant $C_{\ell,d,k}>0$ such that

[TABLE] 2. (2)

Furthermore

[TABLE]

*for some constant $C^{\prime}_{\ell,d,k}>0$ and a non-decreasing function $h(t)$ that satisfies $h(t)>0$ for $t>0$ . * 3. (3)

Lastly, if $d=1$ and $\ell\geq 2$ , (1) and (2) hold.

Theorem 3.2.

In the setting of Theorem 3.1, for $0<d<1$ and $\ell\in\mathbb{Z}$ , define444If $k=0$ , $\mathfrak{e}^{2}_{\ell,0,d}:=\int\lambda^{\ell-2}\mu_{\mathrm{MP},d}(\mathrm{d}\lambda)$ .

[TABLE]

Then as $k\to\infty$ , $\mathfrak{e}^{2}_{\ell,k,d}\to 0$ . Furthermore,

[TABLE]

For $d=1$ , (3.1) holds if $\ell\geq 2$ , and for $0<d\leq 1$

[TABLE]

Corollary 3.2.1.

In the setting of Theorem 3.1, for $0<d<1$ and $\ell\in\mathbb{Z}$ and $n>1$

[TABLE]

If $d=1$ these relations hold for $\ell\geq 2$ .

Proof.

The first claim follows from the Borel–Cantelli Lemma. The second follows from the observation

[TABLE]

which gives

[TABLE]

∎

The last of our main results is almost just a corollary of the above theorems and it concerns halting times (i.e. runtimes or iteration counts, recall (1.1)):

[TABLE]

Since $\|\textbf{e}_{k}\|_{W}$ converges almost surely to $\mathfrak{e}_{1,k,d}$ and $\|\textbf{r}_{k}\|_{2}=\|\textbf{e}_{k}\|_{W^{2}}$ converges almost surely to $\mathfrak{e}_{2,k,d}$ we produce the candidate limit halting times

[TABLE]

Theorem 3.3.

In the setting of Theorem 3.1, $0<d<1$ , for $\ell=1,2$ suppose that $\epsilon\neq\mathfrak{e}_{\ell,k,d}$ for $k=0,1,2,\ldots$ , $\epsilon<\mathfrak{e}_{\ell,0,d}$ , then555Note that $\epsilon<\mathfrak{e}_{\ell,0,d}$ is just a statement that $\epsilon^{2}<(1-d)^{-1}$ for $\ell=1$ and $\epsilon<1$ for $\ell=2$ , and so $\tau_{\epsilon}^{(\ell)}(\beta,d)\geq 1$ , $\ell=1,2$ .

[TABLE]

If $\epsilon=\mathfrak{e}_{\ell,k,d}$ , i.e. $k=\tau^{(\ell)}(\beta,d)$ , for $k>0$ then

[TABLE]

Proof.

Assume $\epsilon\neq\mathfrak{e}_{\ell,k,d}$ for $k=1,2,\ldots$ . Then $\delta=\min_{k}|\epsilon^{2}-\mathfrak{e}_{\ell,k,d}^{2}|>0$ as $\mathfrak{e}_{\ell,k,d}\to 0$ as $k\to\infty$ . Note that if $\kappa=\tau^{(\ell)}_{\epsilon}(\beta,d)$ then

[TABLE]

Then as $\delta\leq|\epsilon^{2}-\mathfrak{e}_{\ell,k-1,d}^{2}|=\mathfrak{e}_{\ell,k-1,d}^{2}-\epsilon^{2}$

[TABLE]

We estimate these two terms individually. First, let $\mathcal{M}_{k}$ be the event where $\|\textbf{e}_{j}\|_{W^{\ell}}$ is weakly decreasing for $j=0,1,2,\ldots,k-1$ . Then

[TABLE]

For sufficiently large $n$ , by Lemma 4.8,

[TABLE]

and for such a value of $n$

[TABLE]

by Theorem 3.1(2). It remains to show that $\mathbb{P}(\mathcal{M}_{k}^{c})\to 0$ . For $\ell=1$ this is immediate because $P(\mathcal{M}_{k}^{c})=0$ . For $\ell=2$ , consider the event $\mathcal{S}_{k}=\{d^{j+1/2}<\|\textbf{e}_{j}\|^{2}_{W^{2}}<d^{j-1/2},~{}~{}j=1,2,\ldots,k-1\}$ . Then $\mathbb{P}(\mathcal{M}_{k}^{c})\leq\mathbb{P}(\mathcal{S}_{k}^{c})$ and

[TABLE]

And this tends to zero by Theorem 3.1(2). The estimate for $\mathbb{P}\left(t^{(\ell)}_{\epsilon}>\tau^{(\ell)}_{\epsilon}(\beta,d)\right)$ is analogous, but now we do not need monotonicity for $\|\textbf{e}_{j}\|_{W^{\ell}}$ and

[TABLE]

as $\delta\leq|\epsilon^{2}-\mathfrak{e}^{2}_{\ell,k,d}|=\epsilon^{2}-\mathfrak{e}_{\ell,k,d}^{2}$ . When $\epsilon=\mathfrak{e}_{\ell,k,d}$ , we use similar arguments to show that

[TABLE]

As $t_{\epsilon}^{(\ell)}(W,\textbf{b})$ and $\tau_{\epsilon}^{(\ell)}(\beta,d)$ are integers, (3.2) follows. ∎

*Remark 3.4**.*

We conjecture that if $\epsilon=\mathfrak{e}_{\ell,k,d}$ , i.e $k=\tau^{(\ell)}(\beta,d)$ , for $k>0$ then

[TABLE]

Indeed Figure 5 indicates this is true because

[TABLE]

appears to be asymptotically normal with a variance that decays like $1/n$ . We note that this is related to, but not a consequence of, the central limit theorem for linear spectral statistics (CLT for LSS). For the CLT for LSS the variance decays as $1/n^{2}$ . In the case at hand, the fluctuations that occur in the random weights (see $\omega_{j}$ in (5.1)) and the fluctuations that occur in the random polynomial $p_{k}^{\dagger}$ contribute to the variance on the order of $1/n$ . This conjecture will be resolved in a forthcoming publication.

4. Technical results from random matrix theory

Lemma 4.1.

Let $\chi_{k}$ be a chi distributed random variable with $k\geq 1$ degrees of freedom. Then for any fixed integers $p$ and $q>0$ there exists $C_{q,p}>0$ such that

[TABLE]

Furthermore, for $t\geq 0$

[TABLE]

For $t\leq 0$ ,

[TABLE]

Proof.

Because $\chi_{k}$ has a density given by

[TABLE]

we are led to analyze

[TABLE]

The result then follows by the change of variables $x=\sqrt{k}y$ and applying the method of steepest descent (Laplace’s method, see e.g., [AF03, Lemma 6.2.3]) for integrals along with Stirling’s approximation. Indeed, suppose $f:\mathbb{(}0,\infty)\to\mathbb{R}$ is smooth and satisfies the bound $|f(y)|\leq C(y^{-K}+y^{L})$ for $K,L,C>0$ . Then for $k>K$ we must estimate

[TABLE]

Note that $\ell(y)$ has a global minimum of zero at $y=1$ on $(0,\infty)$ . For $0<\epsilon<1$ , write

[TABLE]

which decays exponentially to zero as $\ell(1-\epsilon)>0$ . The same calculation on $[1+\epsilon,\infty)$ gives

[TABLE]

which, again, decays exponentially because $\ell(1+\epsilon)>0$ . Laplace’s method gives

[TABLE]

As remarked, Stirling’s approximation then gives the first inequality in the lemma. We note that

[TABLE]

For the second inequality then uses that (4.1)

[TABLE]

The last follows from (4.1) and a similar calculation

[TABLE]

∎

Definition 4.2.

A mean-zero random variable $X$ is called sub-exponential with parameters $\nu,\alpha>0$ if $\mathbb{E}\left[\operatorname{e}^{\lambda x}\right]\leq\operatorname{e}^{\lambda^{2}\nu^{2}/2}$ for $|\lambda|<\frac{1}{\alpha}$ .

It then follows that centered chi variables are sub-exponential. If $X$ is sub-exponential, then clearly $\mathbb{E}\left[\operatorname{e}^{|X|/t}\right]<2$ for some $t>0$ .

A good reference for the next classical result is [Ver18, Section 2.8].

Theorem 4.3 (Bernstein’s inequality for sub-exponential random variables).

Let $(X_{i})_{i\geq 1}$ be a sequence of independent mean zero random variables and define

[TABLE]

Then for $t\geq 0$ and any real numbers $a_{1},\ldots,a_{n}$

[TABLE]

and $K=\max_{1\leq i\neq n}\|X_{i}\|_{\psi_{1}}$ . Here $c>0$ is some absolute constant independent of $\{X_{i}\},\{a_{j}\}$ .

The estimate

[TABLE]

can be found by estimating the density for a $\chi_{\beta n}^{2}$ random variable or by applying Bernstein’s inequality (Recall that a sum of $n$ independent chi-square variables $\chi_{\sigma_{i}}^{2}$ , $i=1,\ldots,n$ is again a chi-square variable $\chi_{\sigma}^{2}$ with $\sigma=\sum_{i}\sigma_{i}$ ). We will use three elementary facts that are encapsulated in the following lemma.

Lemma 4.4.

Let $Z_{1},Z_{2},Y$ be random variables and assume $\mathbb{P}(Y=0)=0$ . The following inequalities hold

(1)

$\displaystyle\mathbb{P}\left(\frac{|Z_{1}|}{|Y|}\geq t\right)\leq\mathbb{P}\left(|Z_{1}|\left[\frac{1}{|Y|}-\frac{1}{\mu}\right]_{+}+\frac{|Z_{1}|}{\mu}\geq t\right)$ * where $[\cdot]_{+}$ denotes the positive part and $\mu>0$ ,* 2. (2)

$\displaystyle\mathbb{P}(|Z_{1}|+|Z_{2}|\geq t)\leq\mathbb{P}(|Z_{1}|\geq t/p)+\mathbb{P}(|Z_{2}|\geq t/q)$ , $1/p+1/q=1$ and 3. (3)

$\displaystyle\mathbb{P}(|Z_{1}||Z_{2}|\geq t)\leq\mathbb{P}(|Z_{1}|\geq t^{1/2})+\mathbb{P}(|Z_{2}|\geq t^{1/2})$ .

Lemma 4.5.

Suppose $-\infty<\lambda_{1,n}<\lambda_{2,n}<\cdots\lambda_{n,n}<\infty$ . Let $(\chi_{\beta}^{(j)})_{j\geq 1}$ be independent chi-distributed random variables with $\beta$ degrees of freedom. Define weights

[TABLE]

Then the Kolmogorov–Smirnov distance

[TABLE]

of

[TABLE]

satisfies

[TABLE]

and the tail estimate

[TABLE]

for absolute constants $C,C_{1},C_{2},c_{1},c_{2}>0$ .

Note that $d_{\mathrm{KS}}(\mu_{n},\nu_{n})\leq 1$ .

Proof.

First, it follows that

[TABLE]

So, we are led to analyze the sums

[TABLE]

As $S_{k}$ has expected value zero, Bernstein’s inequality gives for $t\geq 0$

[TABLE]

for absolute constants $c,K>0$ . From the moment generating function for a chi-square distribution, we have, as $S$ is a chi-square random variable with $n\beta$ degrees of freedom,

[TABLE]

This minimum occurs at $s=\frac{n\beta}{2t}-\frac{1}{2}$ , giving

[TABLE]

Then we write for $0\leq s\leq 1$

[TABLE]

so that

[TABLE]

Now set $t=\frac{1}{1-s}-1=\frac{s}{1-s}$ , $s=\frac{t}{t+1}$ to find

[TABLE]

Then it is easy to see that for $0\leq t\leq 1$

[TABLE]

giving the estimate

[TABLE]

We then write $\tilde{S}_{k}=S_{k}/(n\beta)$ and $\tilde{S}=S/(n\beta)$ so that

[TABLE]

and then we apply each property of Lemma 4.4, in order, to obtain

[TABLE]

for $1/p+1/q=1$ . The tail estimate (4.3) follows by a union bound. We examine $F$ more closely, and get a crude bound

[TABLE]

While we do not specifically need the value, it follows that for a $\chi$ -squared random variable $\chi_{\beta}^{2}$ with $\beta$ degrees of freedom $\|\chi_{\beta}^{2}\|_{\psi_{1}}=\frac{2}{1-(1/2)^{2/\beta}}$ gives $K=\|\chi_{\beta}^{2}-\beta\|_{\psi_{1}}<\infty$ . In summary, we obtain

[TABLE]

Then, we can estimate moments

[TABLE]

where $\Gamma$ denotes the Gamma function. As $(\alpha_{1}+\cdots+\alpha_{k})^{1/m}\leq\alpha_{1}^{1/m}+\cdots+\alpha_{k}^{1/m}$ for all $\alpha_{i}\geq 0$ , it follows that $\left(\mathbb{E}\left[\left|\frac{S_{k}}{S}\right|^{m}\right]\right)^{1/m}\leq C^{\prime}\frac{m}{\sqrt{n}}$ for some $C^{\prime}>0$ .

Thus $\|n^{1/2}\left|\frac{S_{k}}{S}\right|\|_{\psi_{1}}\leq C^{\prime\prime}$ for some absolute constant $C^{\prime\prime}$ . By Jensen’s inequality

[TABLE]

and choosing $s=\max_{k}\|X_{k}\|_{\psi_{1}}$ we obtain

[TABLE]

Thus $\mathbb{E}[d_{\mathrm{KS}}(\mu_{n},\nu_{n})]\leq C\frac{\log n}{n^{1/2}}$ , for some new constant $C$ . Hence $d_{KS}(\mu_{n},\nu_{n})$ converges to zero in $L^{1}$ , in probability and almost surely666Because $d_{\mathrm{KS}}(\mu,\nu)$ is always less than or equal to unity, almost sure convergence gives $L^{1}$ convergence, but we have obtained a rate.. ∎

Theorem 4.6 (Global eigenvalue bounds, see, e.g. [DS01]).

*For the eigenvalues $\lambda_{n}\leq\cdots\leq\lambda_{1}$ of a $\beta$ -Wishart distribution *

[TABLE]

for an absolute constant $c$ .

This immediately implies that for any interval $(a,b)$ such that $[(1-\sqrt{d})^{2},(1+\sqrt{d})^{2}]\subset(a,b)$ there exists a constant $\gamma=\gamma(a,b)>0$ such that

[TABLE]

And it also implies the bound on the distribution function for $\lambda_{n}$ . Define $d_{n}=\frac{n}{m}=d+o(1)$ so

[TABLE]

where $s_{\pm}=t-(1\pm\sqrt{d_{n}})^{2}$ . And the important conclusion from this is that

[TABLE]

where $C_{k}$ is independent of $n$ . We give an analogue of (4.5) for $k<0$ with $\lambda_{1}$ replaced with $\lambda_{n}$ in (4.11).

Lemma 4.7.

The marginal density $R(\mu)$ of the smallest eigenvalue of $\beta mW_{n,\beta,d}$ satisfies

[TABLE]

Moreover, if $m=\lfloor n/d\rfloor$ , $0<d<1$

[TABLE]

where

[TABLE]

Proof.

We follow [ES05]. Define the multivariate Gamma function

[TABLE]

Then the joint density of the eigenvalues $\mu_{n}\leq\cdots\leq\mu_{1}$ of $\beta mW_{n,\beta,d}$ is

[TABLE]

where

[TABLE]

Then

[TABLE]

Then we have

[TABLE]

Define the modified multivariate Gamma function

[TABLE]

and we have

[TABLE]

Then we need to simplify

[TABLE]

This all gives

[TABLE]

Then to estimate, we use that $\Gamma(x+a)=\Gamma(x)x^{a}(1+o(1))$ as $x\to\infty$ for $a$ fixed to write

[TABLE]

where $x=n\beta/2$ and $y=p$ . This is just the reciprocal of the Beta function with asymptotics

[TABLE]

as $x,y\to\infty$ , found using Stirling’s formula. Since $m=n/d-\sigma_{n}$ where $0\leq\sigma_{n}<1$ we can write $x+y=\frac{\beta}{2}(m+1)=\frac{\beta n}{2d}+\gamma_{n}$ , where $0<\gamma_{n}\leq\frac{\beta}{2}\leq 1$ . Therefore

[TABLE]

Since $\gamma_{n}$ is positive and bounded by unity, as $n\to\infty$

[TABLE]

This gives

[TABLE]

∎

Lemma 4.8.

For fixed $k\in\mathbb{Z}$ , $0<d<1$

[TABLE]

as $n\to\infty$ . If $d=1$ these estimates hold for $k\geq 0$ .

Proof.

The case of $k\geq 0$ is classical and implies weak convergence of the ESM to the Marchenko–Pastur law, see [BS10, Section 3.1], for example. For negative powers more work is required. Recall the definition (2.2)

[TABLE]

for all $f\in C_{b}(\mathbb{R}^{+})$ . We extend this definition to $f(\lambda)=\lambda^{-k}$ , $k>0$ . Introduce a continuous truncation of $\lambda^{-k}$ :

[TABLE]

The monotone convergence theorem gives

[TABLE]

The last term is finite for fixed $k$ , provided $n$ is sufficiently large by Lemma 4.7.

For the sake of notation, set $g(\lambda)=g(\lambda;d_{-})$ . Then, consider

[TABLE]

We estimate each of these terms separately. First, we use that

[TABLE]

and show that this tends to zero exponentially. But to establish the last inequality introduce

[TABLE]

The dominated convergence theorem then provides

[TABLE]

And each term in the last sum is bounded by setting $\ell=n$ .

To estimate the expectation (4.9), we use estimates on the marginal density $R(\lambda)$ for $\lambda_{n}$ . Specifically, (4.4) implies that for any $d>0$

[TABLE]

for some constants $C_{k,d},c_{k,d}$ that do not depend on $n$ . And so, for this term we are left estimating

[TABLE]

Then we use (4.6), introducing some constants $c_{d},C_{d}>0$ to estimate

[TABLE]

for some $C_{k,d},c_{k,d}>0$ . Indeed, this converges to zero super exponentially. From (4.10) we obtain

[TABLE]

and we may assume these constants are the same. Therefore $I_{1}\leq C_{k,d}\operatorname{e}^{-c_{k,d}n}$ . To estimate $I_{2}$ we write

[TABLE]

Then the Kolmogorov–Smirnov distance is given by [BS10, Theorem 8.10]

[TABLE]

And therefore $I_{2}=O(n^{-1/2})$ . Finally, to the variance estimate (4.8) for $k<0$ . From [LP09, (4.16) and Remark 4.1]

[TABLE]

We then have

[TABLE]

Then

[TABLE]

The last term here tends to zero at a exponential rate. And because $\lambda^{-k}-g(\lambda)$ is a non-negative, monotonic function it suffices to estimate

[TABLE]

which vanishes again, at an exponential rate. This establishes (4.8) with $m_{k,d,n}$ in place of of $m_{k,d}$ . Then (4.8) follows from (4.7) once one notes that

[TABLE]

∎

The proof of Lemma 4.8 implies the following corollary, which complements the inequality (4.5).

Corollary 4.8.1.

For each $k=-1,-2,\ldots$ there exists $C_{k}>0$ , independent of $n$ , such that

[TABLE]

The final results that we need from random matrix theory come from [GZ00, Corollary 1.8]

Theorem 4.9.

For any Lipschitz function $f:\mathbb{R}^{+}\to\mathbb{R}$

[TABLE]

for some constants $C_{f,d},c_{f,d}$ .

Corollary 4.9.1.

Let $f$ be a continuous function on $(0,\infty)$ , Lipschitz in a neighborhood of $[(1-\sqrt{d})^{2},(1+\sqrt{d})^{2}]$ , with at most polynomial growth at [math] and at $\infty$ , then

[TABLE]

for some constants $C_{f,d},c_{f,d}>0$ .

Proof.

Let $[(1-\sqrt{d})^{2}-t,(1+\sqrt{d})^{2}+t]\subset(a,b)$ , $a>0$ such that $f$ is Lipschitz on $[a,b]$ . Then define

[TABLE]

And estimate

[TABLE]

By Theorem 4.9

[TABLE]

Using Theorem 4.6

[TABLE]

By assumption, there exists $p,q>0$ such that

[TABLE]

Following the proof of Lemma 4.8, we find that

[TABLE]

Similarly, using Theorem 4.6 (with the same constants, for convenience)

[TABLE]

So, the last quantity to estimate is

[TABLE]

Then this probability is bounded above by $\mathbb{P}(C_{3}\operatorname{e}^{-c_{3}n}\geq t)$ and

[TABLE]

Suppose $t/C_{3}<1$ . Then $\mathbb{P}(C_{3}\operatorname{e}^{-c_{3}n}\geq t)\leq\mathbbm{1}_{\{n/\log(C_{3}/t)<c_{3}\}}\leq\operatorname{e}^{-n/\log(C_{3}/t)}\operatorname{e}^{1/c_{3}}$ . Now $n/\log(C_{3}/t)\geq n\frac{t}{C_{3}}$ because $\frac{t}{C_{3}}\log\frac{C_{3}}{t}\leq 1$ as $x\log x^{-1}\leq 1/\operatorname{e}$ for $0<x\leq 1$ . Hence $\mathbb{P}(C_{3}\operatorname{e}^{-c_{3}n}\geq t)\leq\operatorname{e}^{1/c_{3}}\operatorname{e}^{-nt/C_{3}}$ . However, if $t/C_{3}\geq 1$ , then clearly $\mathbb{P}(C_{3}\operatorname{e}^{-c_{3}n}\geq t)=0$ so $\mathbb{P}(C_{3}\operatorname{e}^{-c_{3}n}\geq t)\leq\operatorname{e}^{1/c_{3}}\operatorname{e}^{-nt/C_{3}}$ for all $t>0$ . The corollary follows by applying Lemma 4.4(2) twice. ∎

*Remark 4.10**.*

One might expect Corollary 4.9.1 to hold for all $t>0$ . For this to indeed be true, $f$ needs to be globally Lipschitz (i.e., Lipschitz on every compact subset of $(0,\infty)$ ) and the dependence of $C_{f,d}$ and $c_{f,d}$ in Theorem 4.9 on $f$ needs to be known.

5. Proofs of the main theorems

Proof of Theorem 3.1.

We first use invariance. It follows that the errors $\|\textbf{e}_{k}\|_{W^{\ell}_{n,\beta,d}}$ , $\textbf{e}_{k}=\textbf{e}_{k}(W,\textbf{b})$ realized in the CGA are invariant under unitary transformations, i.e. for $\tilde{W}=UWU^{*}$

[TABLE]

for any unitary matrix $U$ . This follows because if $p_{k}(\lambda)$ a polynomial of degree $k$ then (recall $\textbf{x}_{0}=0$ )

[TABLE]

And so, the mimimum over $p_{k}\in\mathbb{P}_{k}^{(0)}$ must be same in both cases. So, by invarince of $W_{n,\beta,d}$ it suffices to solve

[TABLE]

We then recall formula (2.6) for $p_{k}^{\dagger}(\lambda)$ with $T_{k}=T_{k}(W_{n,\beta,d},\textbf{b}_{0})$

[TABLE]

where

[TABLE]

Here the distribution of $\boldsymbol{\omega}$ is parameterized by (see Appendix A)

[TABLE]

where $\boldsymbol{\nu}$ is a vector of iid $\chi_{\beta}$ -squared random variables. The variable $\boldsymbol{\nu}$ is the square of the first components of the eigenvectors of $W_{n,\beta,d}$ . It is well-known that the eigenvalues and eigenvectors of $W_{n,\beta,d}$ are independent. But, $T_{k}$ is dependent on both the eigenvalues and eigenvectors.

Lemma 5.1.

For $n>0$ ,

[TABLE]

Proof.

We begin with a simple observation

[TABLE]

By Lemma 4.1 it follows that for $q>0$ , $\mathbb{E}[t_{j}^{q}]\leq C_{j,q}$ , where the bound is independent of $n$ . Similarly

[TABLE]

regardless of if $q$ is positive or negative, see (4.5) and (4.11). Therefore, it suffices to show that

[TABLE]

Because of (2.5) we have

[TABLE]

where these chi-distributed random variables are independent. Then repeated use of the identity

[TABLE]

gives

[TABLE]

The first term tends to zero in any $L^{p}$ norm by Lemma 4.1 at a rate $n^{-1/2}$ , and the second term is bounded uniformly in any $L^{p}$ norm. ∎

Next, we argue that while the measure is still random, we can replace the integrand with a deterministic one.

Lemma 5.2.

For $n>0$ ,

[TABLE]

Proof.

Write $\det(\mathbb{T}_{k}-\lambda I)^{2}=\sum_{j=0}^{2k}\tau_{j}\lambda^{j}$ . Using the notation of (5.2), it suffices to show that $|t_{j}-\tau_{j}|\to 0$ in $L^{2}$ at a rate $n^{-1/2}$ . Consider the product

[TABLE]

where $p_{j},q_{j}\in\{0,1,2,3,4\}$ and $0\leq d_{j},s_{j}\leq k$ where the chi random variables are independent. Using (5.3) we write

[TABLE]

where $\chi_{j}^{(1)}=\chi_{\beta(n-d_{j})}^{p_{j}}/(\beta dm)^{p_{j}/2},~{}\chi_{j}^{(1)}=\chi_{\beta(n-s_{j})}^{q_{j}}/(\beta m)^{q_{j}/2}$ . Then

[TABLE]

Using (5.3) and Lemma 4.1 it follows that $\mathbb{E}\left[\left(\chi_{j}^{(1)}\chi_{j}^{(2)}-1\right)^{4}\right]^{1/2}=O(n^{-2})$ . Then for $n>0$

[TABLE]

where $p=\sum_{j}p_{j}$ and $q=\sum_{j}q_{j}$ . This follows from again using Lemma 4.1, which implies that all $L^{p}$ norms of $\chi_{j}^{(i)}$ are bounded as $n\to\infty$ . Then, one notes that the $L^{2}$ norm of $t_{j}-\tau_{j}$ can be bounded by a sum of terms of the form (5.4). This establishes this lemma.

∎

Lemma 5.3.

For $n>1$ ,

[TABLE]

Proof.

Write $f(\lambda)=\lambda^{\ell-2}{\det(\mathbb{T}_{k}-\lambda I)^{2}}$ and integrate by parts

[TABLE]

where $F_{\beta,d}(x)=\mu_{n,\beta,d}((-\infty,x])-\nu_{n,\beta,d}((-\infty,x])$ . Therefore

[TABLE]

Therefore

[TABLE]

by the independence of eigenvalues and eigenvectors ( $d_{\mathrm{KS}}(\mu_{n,\beta,d},\nu_{n,\beta,d})$ is independent of the eigenvectors). Then, we just note that there exists power $p,q\geq 0$ such that

[TABLE]

and therefore

[TABLE]

is bounded uniformly in $n$ by (4.5) and (4.11). The lemma follows from Lemma 4.5. ∎

These three lemmas combined with Lemma 4.8 establishes the Theorem 3.1(1).

For the second part, we again establish a series of lemmas.

Lemma 5.4.

For $n\geq 0$

[TABLE]

for a non-decreasing function $g(t)$ that satisfies $g(t)>0$ for $t>0$ , and for some constant $C>0$ .

Proof.

For $t\geq 0$ , let $\Lambda_{d}(C)$ be the event on which $C^{-1}\leq\lambda_{n}\leq\lambda_{1}\leq C$ for $C>(1+\sqrt{d})^{2}$ and $1/C<(1-\sqrt{d})^{-2}$ . Then

[TABLE]

where $g_{d}(C)>0$ . This follows from (4.4). Now, we make two elementary observations about

[TABLE]

Recall (2.3) and it follows that $\tau_{j}=\tau_{j}(H_{n,\beta,d}/\sqrt{\beta m})$ is a Lipschitz function of the entries $(h_{ij})_{i\geq j}$ of $H_{n,\beta,d}/\sqrt{\beta m}$ in any closed $\epsilon$ -neighborhood $0<\epsilon<1$ of $\mathbb{H}_{d}$ in the max norm777The max norm gives the maximum entry, in modulus. on lower-triangular matrices. Let $L_{\epsilon,j}$ be the Lipschitz constant. The second observation is to let $Z_{d}(t)$ be the event where

[TABLE]

By Lemma 4.1, for $0<t\leq\epsilon\leq 1$ , $\mathbb{P}(Z_{d}(t))\geq 1-C_{k,d}\operatorname{e}^{-nc_{k,d}}$ for some constants $C_{k,d},c_{k,d}>0$ . Therefore

[TABLE]

The lemma follows. ∎

Lemma 5.5.

For $n\geq 0$

[TABLE]

Proof.

Recalling the notation $\Lambda_{d}(C)$ of the proof of the previous lemma, we then define the event

[TABLE]

Using the notation of (4.3)

[TABLE]

Then

[TABLE]

and then we find for a constant $C_{k}>0$

[TABLE]

Therefore

[TABLE]

and this establishes the lemma. ∎

Applying Corollary 4.9.1 establish along with these two lemmas establishes Theorem 3.1(2).

For the case of $d=1$ and $\ell\geq 2$ no inverse powers of $\lambda$ will be encountered in any integral. So, the fact that Lemma 4.8 applies only for $k\geq 0$ is not an issue. Theorem 4.6 holds for $d=1$ , Theorem 4.9 indeed holds for $d=1$ and Corollary 4.9.1 holds for $k=1$ provided the function $f$ is Lipschitz continuous at $\lambda=0$ . And Lemmas 5.1, 5.2, 5.3, 5.4 and 5.5 hold for $d=1$ provided $\ell\geq 2$ . ∎

Proof of Theorem 3.2.

To evaluate

[TABLE]

we make a simple change of variable $\lambda=\frac{d_{+}-d_{-}}{2}x+\frac{d_{+}+d_{-}}{2}=2x\sqrt{d}+1+d$ so that

[TABLE]

Then examine

[TABLE]

Next, we define $\det D_{0,d}(x)=1$ and compute

[TABLE]

Note that (5.7) is the recurrence relation for the Chebyshev polynomials $T_{n}$ and $U_{n}$ of the first and second kinds. We need some elementary properties of $T_{n}$ and $U_{n}$ (see, e.g. [MH03]):

[TABLE]

where the ′ denotes that the $j=0$ term is halved. From the last equality it follows by differentiation that

[TABLE]

Recalling (5.6) with $\ell=1$ we have

[TABLE]

Matching initial conditions for $D_{k,d}$ at $k=1$ and $k=2$ we obtain

[TABLE]

Therefore

[TABLE]

Continuing,

[TABLE]

and this gives

[TABLE]

For $\mathfrak{e}_{2,k,d}$ , we use $T_{k}(x)=\frac{1}{2}U_{k}(x)-\frac{1}{2}U_{k-2}(x)$ for $k\geq 1$ and $U_{0}(x)=T_{0}(x)$ to find

[TABLE]

Then

[TABLE]

And this gives

[TABLE]

For $\mathfrak{e}_{3,k,d}$ we find

[TABLE]

and this gives

[TABLE]

Lastly, one can use the bound $|U_{k}(x)|\leq k$ to see that $\mathfrak{e}_{l,k,d}\to 0$ as $k\to\infty$ provided $0<d<1$ .

∎

Appendix A The eigenvalues and eigenvectors of Wishart matrices

Let $W=W_{n,\beta,d}=U\Lambda U^{*}$ , $U^{*}U=I$ . It is an important fact that the joint distribution of the vector

[TABLE]

can be parameterized by

[TABLE]

where $\boldsymbol{\nu}$ is a vector of iid $\chi_{\beta}^{2}$ random variables. For the convenience of the reader we now derive (A.1).

Definition A.1.

$O(n)$ (resp., $U(n)$ ) denotes the group of $n\times n$ orthogonal (resp., unitary) matrices.

We recall some general facts about Haar measure (see, e.g., [Fol99]).

Theorem A.2.

Let $G$ be a locally compact Hausdorff topological group. Then $G$ has a left invariant measure $\mu$ (i.e., a left Haar measure) and an right invariant measure $\nu$ (i.e., a right Haar measure) on the $\sigma$ -algebra generated by all open subsets of $G$ . The measures are unique up to a positive multiplicative constant.

For a Borel set $S$ , let $S^{-1}$ be the set of inverses of $S$ . Define

[TABLE]

Then it is easy to see that $\mu_{-1}$ is a right Haar measure. Thus by uniqueness,

[TABLE]

for some $k>0$ . Now, on the other hand, the left translate of a right invariant measure is still right invariant. Thus, for all $g$ , by uniqueness,

[TABLE]

for some positive scaling factor $\Delta(g)$ — the modular function. $\Delta(g)$ is a continuous group homomorphism into the multiplicative group of positive numbers. A group is called unimodular if $\Delta(g)\equiv 1$ . Clearly, it follows from (A.3) that $G$ is unimodular if and only if Haar measure is both left and right invariant, i.e., $\nu(S)=k^{\prime}\mu(S)$ , $k^{\prime}>0$ . There are many examples of unimodular groups: most importantly for us, compact groups, (such as $O(n)$ or $U(n)$ ) are unimodular.

If $G$ is unimodular, it follows from (A.2) that

[TABLE]

Setting $S=G$ in (A.4) we find $kk^{\prime}=1$ . Hence

[TABLE]

In particular for $U(n)$ , we see that

[TABLE]

and for $O(n)$

[TABLE]

We proceed to show that (A.1) holds if $U$ is distributed according to Haar measure on $O(n)$ or $U(n)$ . In order to construct Haar measure on $O(n)$ (or $U(n)$ resp.) let $X$ be an $n\times n$ matrix of iid real (or complex, resp.) standard normal random variables. In such a setting we say that $X$ belongs to the real (or complex, resp.) Ginibre ensemble. Then the QR decomposition of $X$ gives

[TABLE]

The $QR$ decomposition is unique if $X$ is non-singular. Now the Ginibre ensemble is clearly invariant under multiplication on the left by a matrix $G\in O(n)$ (or $G\in U(n)$ , resp.),

[TABLE]

So, the pair $(GQ,R)$ gives the QR decomposition of $GX\overset{\text{dist.}}{=}X$ . Thus $Q\overset{\text{dist.}}{=}GQ$ . Hence $X\mapsto Q$ induces Haar measure on $Q$ . But

[TABLE]

implying

[TABLE]

By (A.7) it follows that

[TABLE]

*Remark A.3**.*

From (A.9) we see that the components of the “first” eigenvector are proportional to independent $\chi_{\beta}$ variables. But there is no “first” eigenvector: this should be the distribution for the components of any one eigenvector! This follows by just reordering the eigenvalues.

To establish (A.1), we now show that the eigenvectors of $W=XX^{*}$ are Haar distributed on $U(n)$ when the entries of $X$ are iid standard complex normal random variables. Here $X$ is $n\times m$ , $m\geq n$ . Similarly, the eigenvectors of $W=XX^{T}$ , $X$ is $n\times m$ , $m\geq n$ , are Haar distributed on $O(n)$ , where the entries of $X$ are iid standard (real) normal random variables.

We follow the argument in [For10].

Step 1

We consider the complex case, $\beta=2$ . The case $\beta=1$ is similar. Apply the QR decomposition to $X^{*}$ to obtain $X^{*}=U_{1}T$ where $U_{1}$ is $m\times n$ and $T$ is $n\times n$ , $U_{1}^{*}U_{1}=I_{n}$ , $T$ is upper triangular, $T_{ii}>0$ .

We use the following notation: if

[TABLE]

is the matrix of differentials of $Y$ , then $(\mathrm{d}Y)$ denotes the wedge product of independent entries of $Y$ ; e.g. if $Y$ is real symmetric then

[TABLE]

Then one finds (see [For10, Proposition 3.2.5] that

[TABLE]

Step 2

From $W=T^{*}T$ , we find (see [For10, Proposition 3.2.6])

[TABLE]

Step 3

Substituting (A.11) into (A.10) we find

[TABLE]

and so

[TABLE]

Integrating out the independent variables $U_{1}$ we finally arrive at the pdf for $W$

[TABLE]

for some normalization constant $C_{n,m}>0$ .

Step 4

Recomputation in the $\beta=1$ case we find the general formula for the pdf of $W$ for $\beta=1,2$ , and some constant $C_{n,m,\beta}$

[TABLE]

Step 5

. Now we use the standard computation for $(\mathrm{d}W)$ when it is Hermitian ( $\beta=1,2$ ): For the spectral decomposition ( $\beta=2$ ) $W=Q\Lambda Q^{*}$ , we find (see [For10, (1.11)])

[TABLE]

and for $\beta=1$ , $W=Q\Lambda Q^{T}$

[TABLE]

Inserting (A.15), (A.14) into (A.13) we find the pdf of $W$

[TABLE]

where $Q^{*}=Q^{T}$ for $\beta=1$ and $V(\lambda)$ is the Vandermonde for $\lambda_{1},\ldots,\lambda_{n}$ .

Finally we see that the singular values of $X$ and the singular vectors for $X$ are independent. As $Q^{*}\mathrm{d}Q$ is left (and hence, right) invariant, we see that $Q^{*}\mathrm{d}Q$ is Haar measure and hence the eigenvectors of $W=X^{*}X$ are Haar distributed. Therefore (A.1) follows.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AF 03] M J Ablowitz and A S Fokas, Complex Varibles: Introduction and Applications , second ed., Cambridge University Press, 2003.
2[BKYY 16] A Bloemendal, A Knowles, H-T Yau, and J Yin, On the principal components of sample covariance matrices , Probability Theory and Related Fields 164 (2016), no. 1-2, 459–552.
3[BMP 07] Z. D. Bai, B. Q. Miao, and G. M. Pan, On asymptotics of eigenvectors of large sample covariance matrix , Annals of Probability 35 (2007), no. 4, 1532–1572.
4[BS 10] Z Bai and J W Silverstein, Spectral Analysis of Large Dimensional Random Matrices , Springer Series in Statistics, Springer New York, New York, NY, 2010.
5[BY 93] Zhidong Bai and Y.Q. Yin, Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix , Annals of Probability 21 (1993), no. 3, 1275–1294.
6[DE 02] I Dumitriu and A Edelman, Matrix models for beta ensembles , Journal of Mathematical Physics 43 (2002), no. 11, 5830.
7[DLT 85] P Deift, L C Li, and C Tomei, Toda flows with infinitely many variables , Journal of Functional Analysis 64 (1985), no. 3, 358–402.
8[DMOT 14] P A Deift, G Menon, S Olver, and T Trogdon, Universality in numerical computations with random data , Proceedings of the National Academy of Sciences of the United States of America 111 (2014), no. 42, 14973–8.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic

Abstract.

2010 Mathematics Subject Classification:

1. Introduction

Remark 1.1*.*

1.1. Comparison and demonstration

1.1.1. A numerical demonstration

1.1.2. Relation to previous work

Remark 1.2*.*

Remark 1.3*.*

2. The bidiagonalization of Wishart matrices and invariance

Definition 2.1**.**

Definition 2.2**.**

2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm

3. Main results

Theorem 3.1**.**

Theorem 3.2**.**

Corollary 3.2.1**.**

Proof.

Theorem 3.3**.**

Proof.

Remark 3.4*.*

4. Technical results from random matrix theory

Lemma 4.1**.**

Proof.

Definition 4.2**.**

Theorem 4.3** (Bernstein’s inequality for sub-exponential random variables).**

Lemma 4.4**.**

Lemma 4.5**.**

Proof.

Theorem 4.6** (Global eigenvalue bounds, see, e.g. [DS01]).**

Lemma 4.7**.**

Proof.

Lemma 4.8**.**

Proof.

Corollary 4.8.1**.**

Theorem 4.9**.**

Corollary 4.9.1**.**

Proof.

Remark 4.10*.*

5. Proofs of the main theorems

Proof of Theorem 3.1.

Lemma 5.1**.**

Proof.

Lemma 5.2**.**

Proof.

Lemma 5.3**.**

Proof.

Lemma 5.4**.**

Proof.

Lemma 5.5**.**

Proof.

Proof of Theorem 3.2.

Appendix A The eigenvalues and eigenvectors of Wishart matrices

Definition A.1**.**

Theorem A.2**.**

Remark A.3*.*

Step 1

Step 2

Step 3

Step 4

Step 5

*Remark 1.1**.*

*Remark 1.2**.*

*Remark 1.3**.*

Definition 2.1.

Definition 2.2.

Theorem 3.1.

Theorem 3.2.

Corollary 3.2.1.

Theorem 3.3.

*Remark 3.4**.*

Lemma 4.1.

Definition 4.2.

Theorem 4.3 (Bernstein’s inequality for sub-exponential random variables).

Lemma 4.4.

Lemma 4.5.

Theorem 4.6 (Global eigenvalue bounds, see, e.g. [DS01]).

Lemma 4.7.

Lemma 4.8.

Corollary 4.8.1.

Theorem 4.9.

Corollary 4.9.1.

*Remark 4.10**.*

Lemma 5.1.

Lemma 5.2.

Lemma 5.3.

Lemma 5.4.

Lemma 5.5.

Definition A.1.

Theorem A.2.

*Remark A.3**.*