The Local Ledoit-Peche Law
Van Latimer, Benjamin D. Robinson

TL;DR
This paper refines the Ledoit-Peche law by establishing an optimal convergence rate for functions of random covariance matrices, with implications for improved shrinkage covariance estimation.
Contribution
It provides an essentially optimal convergence rate for the Ledoit-Peche law, advancing understanding of covariance matrix estimators in high-dimensional statistics.
Findings
Established an optimal convergence rate for the Ledoit-Peche law
Hypothesized the rate to be the minimal possible for MV loss
Implications for improved shrinkage covariance estimation
Abstract
Ledoit and Peche proved convergence of certain functions of a random covariance matrix's resolvent; we refer to this as the Ledoit-Peche law. One important application of their result is shrinkage covariance estimation with respect to so-called Minimum Variance (MV) loss, discussed in the work of Ledoit and Wolf. We provide an essentially optimal rate of convergence and hypothesize it to be the smallest possible rate of excess MV loss within the shrinkage class.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Matrix Theory and Algorithms · Blind Source Separation Techniques
The Local Ledoit-Péché Law
Van Latimer
and
Benjamin D. Robinson
Abstract.
Ledoit and Péché, in [12], proved convergence of certain functions of a random covariance matrix’s resolvent; we refer to this as the Ledoit-Péché law. One important application of their result is shrinkage covariance estimation with respect to so-called Minimum Variance (MV) loss, discussed in the work of Ledoit and Wolf [13]. We provide an essentially optimal rate of convergence and hypothesize it to be the smallest possible rate of excess MV loss within the shrinkage class.
1. Introduction
Let be an matrix; we assume that converges to a limit as both and tend to (although this may be relaxed). Let
[TABLE]
with , be an real symmetric or complex Hermitian positive definite matrix together with its eigendecomposition.
We will make the following assumption about the “training-data” matrix .
Assumption 1**.**
* is diagonal and the matrix has i.i.d. columns . *
As is common in Random Matrix Theory, the dimensions and of and , and of most every other matrix we will study, are assumed to go to infinity. Thus practically every major quantity of interest is a sequence of quantities, and all properties that we desire to study are those which emerge in the large dimensional limit. We therefore always, except perhaps when special emphasis is needed, suppress the dependence of matrices and functions thereof on the dimensions , , etc. We consider the sample covariance matrix
[TABLE]
We let
[TABLE]
with be its spectral decomposition.
It is a problem of great theoretical and practical interest to understand how the properties of the sample covariance matrix relate to properties of the population covariance matrix .
Random Matrix Theory has had great success in the last decade in getting very fine control of random matrices by way of their resolvent
[TABLE]
This was first done by Marčenko and Pastur ([19]). Their approach was to show that the trace of the resolvent of a random matrix approximately satisfies some self-consistent equation and then to reason that trace of the resolvent, which is also the Stieltjes transform of the emperical eigenvalue measure, must be close to the true solution to the self-consistent equation. This has remained a popular and powerful technique.
One finds that the resolvent of can be written as
[TABLE]
Ledoit and Péché, in their paper [12], consider functions of the form
[TABLE]
for some function with finitely many discontinuities, which amounts to a weighting of the spectral decomposition of the resolvent; components of the resolvent in different eigendirections of the population covariance matrix are weighted according to the value of applied to the associated eigenvalue of the population covariance matrix. may be simplified as follows:
[TABLE]
Here we recall that for a function and a diagonal matrix , we define a matrix by
[TABLE]
and for a real symmetric or complex Hermitian matrix with spectral decomposition , we define
[TABLE]
The case is of course of interest. In this case we have
[TABLE]
where for a measure we define the Stieltjes transform by
[TABLE]
and where by we denote the complex upper half-plane
[TABLE]
We note that and are usually the way we will denote the real and imaginary parts of a complex argument to a Stieltjes transform.
The quantity (1.10) has been deeply understood for a fairly general class of matrices, which we will detail soon.
We will also particularly interested in the case that is the identity, ie, . In this case, has another simplification:
[TABLE]
where
[TABLE]
The reason this case of is of particular interest to us is that the quantities
[TABLE]
are precisely the quantities which describe the Frobenius-norm optimal rotation equivariant shrinkage estimator for the population covariance matrix, as shown in [12]. Another result of the same paper was bounding close to a deterministic measure was one result of, although they did not provide rates of convergence. In this paper we improve their result by providing essentially optimal rate of convergence.
Definition 1.1** (Shrinkage Estimators and Loss Function).**
Given a realization of the sample covariance matrix , we define a (rotation equivariant) shrinkage estimator for the population covariance matrix via
[TABLE]
for some diagonal matrix . That is, we estimate the from by keeping the eigenvectors and changing the eigenvalues, presumably “shrinking” them since has the tendency to “spread out” the eigenvalues of .
To measure the success of we define the loss function
[TABLE]
Here stands for minimum variance, and “ represents the true variance of the linear combination of the original variables that has the minimum estimated variance.” (See [13] for the quote and for more discussion of the suitability of this loss function.)
Lemma 1.2**.**
With respect to , the optimal shrinkage estimator is given when , where is a scaling constant that we will take to be 1.
Definition 1.3**.**
We define
[TABLE]
and we call the shrinkage oracle.
Remark 1.4*.*
As noted above, the same choice of is optimal for a Frobenius norm loss function [12].
The optimal shrunken eigenvalues are experimentally unavailable to us, so for statistical purposes Lemma 1.2 is of limited use to us. However, just as was done for , in [2], namely, bounding it optimally close to a determinstic limit, we will do for .
Given our random matrix with independent entries and our population covariance matrix , we define the resolvent, or Green function introduced in [2]:
[TABLE]
The reader accustomed to Random Matrix Theory will note that this is not the usual defintion of the resolvent. It is however an important realization made in [2] that the more familiar resolvent
[TABLE]
can be neatly gotten from , as well as the related resolvent
[TABLE]
and that this single matrix containing both resolvents unlocks powerful tools for studying resolvent estimates developed in the context of Wigner matrices. In fact, may be decomposed in block form as
[TABLE]
where the blocks labeled are not of interest to us currently. The conjugation of by is not of great importance to the authors of [2], but it is very fortunate for us, the reason being that , our object of greatest interest, is precisely
[TABLE]
by the invariance of under cyclical permutation. Let us make a few more definitions and then quote a result of [2].
We define the the population spectral measure, or PSM, of by
[TABLE]
This is of course just the probability measure which places equal weight at each of ’s eigenvalues, counted with multiplicity.
We also define the following notation of size for random variables, introduced in [20], which has proven very helpful for formulating results in RMT.
Definition 1.5** (Stochastic Domination).**
Given two sequences of random variables and (note that we again suppress that quantities of interest are sequence in ), we say that stochastically dominates , or that , if for any (small) , (large) , and sufficiently large , we have
[TABLE]
A little more notation: we define the important function as the unique such value solving
[TABLE]
for .
We will list some things that we know about .
- •
The equation (1.26), which we have taken as the definition of , is the one which allows us to make the connection between the different definitions of in the papers [12] and [2].
- •
is also the unique solution to , where
[TABLE]
- •
exists and is given by
[TABLE]
where is the Radon-Nikodym derivative of with respect to , and where is the Hilbert transform (see [13]). We mention Hilbert transforms because [13] presents it this way and explains how the presence of the Hilbert transform provides a theoretical explanation for the phenomenon of eigenvalue shrinkage, but we will not heavily use the Hilbert transform in our treatment; we will use a equation from [2] which satisfies to get control of ’s real and imaginary parts directly.
At this time, let us also define the shrinkage function
[TABLE]
This function appeared first in a slightly different form in [12] and then in its stated form in [13]; in both cases it is useful to us as an approximator to the values which describe the optimal shrinkage estimator.
Remark 1.6*.*
We note that there is a small discrepancy between our definition of and the definition of in the context of [12] for . However, this discrepancy only amounts to how the limiting empirical spectral measure weights 0, and thus can be easily accounted for.
Theorem 1.7** (Informal Statement of [2]’s main result).**
Define the matrix
[TABLE]
Then element by element, is very close to .
Theorem 1.8** (Slightly more Formal Statement of part of Theorem 1.7).**
If ’s population spectral measure satisfies some mild regularity constraints, then
[TABLE]
This result is essentially optimal (up to the definitions in ) and cannot be gotten naively. The paper also provides essentially optimal bounds on individual resolvent elements; individual diagonal entries of are themselves close to , but the difference is of an order in the bulk spectrum; this means that in averaging the diagonal elements to get the normalized trace, there is a fair bit of cancellation between different diagonal elements, as there is between independent random variables. The task of finding the “parts” of the random variables which are independent to one another and thus provide this cancellation is the content of a “Fluctuating Averaging Lemma” in random matrix theory.
A corollary of this result is the “Marchenko-Pastur law on small scales”
Corollary 1.8.1**.**
For any interval , we have
[TABLE]
This corollary is important in that it captures the fact that statements about Stieltjes transforms of measures, which we have in great strength thanks to the techniques of [2], can be translated into statements about the measures themselves. This statement captures the fact that empirical eigenvalue distribution is given very accurately by a certain deterministic distribution, even on very fine scales: it says “even in intervals which are predicted to contain only eigenvalues of according the to deterministic measure , we do have that the prediction is correct to leading order with very high probability.”
What we will first do in this note is adapt the proof of 1.8 to deal also with the quantity
[TABLE]
which leads us to our first main result:
Theorem 1.9**.**
If satisfies the same regularity conditions as required for Theorem 1.8, then we have
[TABLE]
Using equation (1.26), we may rewrite the limit as
[TABLE]
Just as Marchenko-Pastur Law at small scales followed from Theorem 1.8, so does our second main result. First we observe that [12, Theorem 4] is equivalent to saying that the function from (1.29) is the Radon-Nikodym derivative of the limiting measure for against the deformed Marchenko Pastur law. Our second main result adds a rate of convergence to this limiting behavior, as follows:
Corollary 1.9.1**.**
We have for any interval that
[TABLE]
If is the shrinkage estimator , then the above implies an order of error between and :
[TABLE]
Boundedness and continuity of , together with the Portmanteau theorem, can then be used to show:
[TABLE]
Further, we hypothesize that no bona fide shrinkage estimator can make this error asymptotically smaller, in which case this would be the smallest possible excess MV loss within the shrinkage class, as claimed in the abstract.
2. Relation to Previous Works
This paper is, firstly, a direct successor to the papers [12] and [13] which advances a program established therein using recent advances in RMT. Another important connection is to the papers [18] and [17], which discuss a measure which is related to ours: for a fixed unit vector , they study
[TABLE]
As in our context, they prove that this measure is close to a deterministic limit, which [17] calls (up to adjusting from their context to ours). In particular, they prove
[TABLE]
The earlier paper [18] establishes the optimality of the factor by remarkably establishing the joint asymptotic Gaussian distribution of any different analytic functions integrated against (one way that [17] differs from or improves on [18] is in the very high probability with which the error bounds hold).
We can recover our measure from : indeed, if are the eigenvectors of , then
[TABLE]
Similiarly, the limiting deterministic measures satisfy
[TABLE]
So our main result, with the error weakened from to , is a consequence of the results of [17].
The improvement by a factor of that occurs after averaging is exactly reminiscent of the central limit theorem, which hints at a sort of independence between the meaures and . Also note that this improvement of does not reflect an improvement of our work over theirs; the error bound of gotten in [17] is optimal, and it is only after the averaging over many different deterministic vectors that one sees the the improvement. Thus, our work is to their work as an averaged local law is to an entry-wise local law.
Lastly, one should note that an ultimate goal of the program investigated in [17] is to establish the convergence of the CDF of to a Brownian bridge, which would amount to finding some “internal” independence inside in the form of independence between the quantities . The “external” indepdence that we have hinted at between the measures and is not unrelated to this.
The connections between the papers [18], [17] and ours are perhaps deeper than we have realized; future work will hopefully bring further connections to light.
3. Tools
First, we extend the definition of matrix multiplication to matrices indexed by arbitrary sets.
Definition 3.1**.**
Let be a finite set for . Let be a matrix, and let be a matrix. We define the matrix product to be a matrix satisfying
[TABLE]
Definition 3.2**.**
Let be an invertible matrix matrix and let be its inverse. We define the minor via
[TABLE]
Lemma 3.3** (Resolvent Identities).**
Let be an invertible matrix and .
- (1)
If ,
[TABLE] 2. (2)
If ,
[TABLE] 3. (3)
We have
[TABLE]
Proof.
- (1)
By the definition of an the inverse, it suffices to show that
[TABLE]
But the left-hand side indeed yields
[TABLE]
since we have assumed . 2. (2)
This proof is taken from [16]. Using part , we get
[TABLE]
as desired. 3. (3)
This is an immediate consequence of Schur’s complement formula, wherein if
[TABLE]
then
[TABLE]
provided all the inverses exist. To see how Schur’s complement formula applies, it is helpful to write
[TABLE]
It is of course crucial that we use the correct definition of matrix multiplication here.
∎
Next we apply these general matrix algebra facts to our specific resolvent .
Corollary 3.3.1** (Resolvent Identities for ).**
- (1)
For we have
[TABLE] 2. (2)
If , then
[TABLE]
Similarly if , then
[TABLE]
Lastly if and , then
[TABLE] 3. (3)
We have
[TABLE]
Proof.
These all follow from the lemma; only the correct definition of matrix multiplication must be used, and one should note that many of the lemmas’ conclusions are insensitive to the diagonal entries of , so that one may see a
[TABLE]
when one expects to see a
[TABLE]
∎
Let us cite one of the main results of [2]
Theorem 3.4** (Entrywise Local Law).**
For any deterministic unit vectors , one has
[TABLE]
where
[TABLE]
4. Proof of Theorem 1.9
First, we prove a partial result. We do not include many details, and only show how section 5 of [2] may be quickly adapted to our setting.
Lemma 4.1**.**
If is diagonal, then Theorem 1.9 holds.
Proof.
This proof amounts to adapting section 5 of [2] to address the top-left corner of thei matrix . We use the notation of [2] without further comment.
Equation (5.15) of [2] reads
[TABLE]
A Taylor expansion on this quantity, justified because by Lemma 5.2 of [2], yields
[TABLE]
Averaging now over yields
[TABLE]
Using that , we have by lemma 5.6 of [2] that
[TABLE]
Using equation (3.10) of [2] to bound , we have , so that our equation (4.4) becomes
[TABLE]
Equation (3.1) of [2] also yields that holds with high probability, so that equation (4.3) yields
[TABLE]
Furthermore, it is a result of section 5 of [2] that
[TABLE]
so that may be replaced with (it is noted that under our regularity assumptions). Finally, is precisely , and
[TABLE]
so that we are done.
∎
Remark 4.2*.*
In the recent paper [17], a very similar result to our 4.1 is proven in their equation (3.13), under much more general moment assumptions on ; however, the resolvent used in that paper differs slightly from the one used in this paper. We hope to use some of their techniques perhaps to weaken some of the moment assumptions in our work.
Acknowledgements
This work was supported by the US Air Force Office of Scientific Research, Lab Task number 19RYCOR036. The views and opinions of this paper do not necessarily reflect the official positions of the Air Force. The public affairs approval number of this document is AFRL-2021-2586.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. D. Robinson, R. Malinas and A. O. Hero, ”Space-Time Adaptive Detection at Low Sample Support,” in IEEE Transactions on Signal Processing, vol. 69, pp. 2939-2954, 2021, doi: 10.1109/TSP.2021.3076883.
- 2[2] Knowles, A. and Yin, J. Anisotropic Local Laws for Random Matrices . Probab. Theory Relat. Fields 169, 257–352 (2017). https://doi.org/10.1007/s 00440-016-0730-4
- 3[3] Antti Knowles. Jun Yin. The outliers of a deformed Wigner matrix . Ann. Probab. 42 (5) 1980 - 2031, September 2014. https://doi.org/10.1214/13-AOP 855
- 4[4] Knowles, A. and Yin, J. (2013), The Isotropic Semicircle Law and Deformation of Wigner Matrices . Commun. Pur. Appl. Math., 66: 1663-1749. https://doi.org/10.1002/cpa.21450
- 5[5] A. Bloemendal, L. Erdös, A. Knowles, H.-T. Yau, and J. Yin, Isotropic local laws for sample covariance and generalized Wigner matrices , Electron. J. Probab 19(2014), 1–53.
- 6[6] Bloemendal, A., Knowles, A., Yau, HT. and Yin, J. On the principal components of sample covariance matrices . Probab. Theory Relat. Fields 164, 459–552 (2016). https://doi.org/10.1007/s 00440-015-0616-x
- 7[7] Bao, Z., Ding, X., Wang, K., and Wang, K. Statistical Inference for Principal Components of Spiked Covariance Matrices (2020). ar Xiv:2008.11903 v 2 [math.ST]
- 8[8] Noureddine El Karoui. Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond . Ann. Appl. Probab. 19 (6) 2362 - 2405, December 2009. https://doi.org/10.1214/08-AAP 548
