How close is the sample covariance matrix to the actual covariance matrix?
Roman Vershynin

TL;DR
This paper investigates the minimal sample size needed for the sample covariance matrix to accurately approximate the true covariance in high-dimensional settings, proposing bounds close to the conjectured optimal.
Contribution
It conjectures and partially proves that the optimal sample size for accurate covariance estimation is linear in dimension for distributions with finite fourth moments.
Findings
Proves N = O(n) sample size up to iterated logarithmic factors.
Supports the conjecture that N = O(n) suffices for distributions with finite fourth moments.
Builds on and extends previous results for second moment and sub-exponential distributions.
Abstract
Given a probability distribution in R^n with general (non-white) covariance, a classical estimator of the covariance matrix is the sample covariance matrix obtained from a sample of N independent points. What is the optimal sample size N = N(n) that guarantees estimation with a fixed accuracy in the operator norm? Suppose the distribution is supported in a centered Euclidean ball of radius \sqrt{n}. We conjecture that the optimal sample size is N = O(n) for all distributions with finite fourth moment, and we prove this up to an iterated logarithmic factor. This problem is motivated by the optimal theorem of Rudelson which states that N = O(n \log n) for distributions with finite second moment, and a recent result of Adamczak, Litvak, Pajor and Tomczak-Jaegermann which guarantees that N = O(n) for sub-exponential distributions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Advanced Statistical Methods and Models · Sparse and Compressive Sensing Techniques
