On the Nystr\"om and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Data Sets
Darren Homrighausen, Daniel J. McDonald

TL;DR
This paper evaluates Nyström and column-sampling methods for approximate PCA on large datasets, analyzing their theoretical accuracy and practical efficiency through simulations and real data experiments.
Contribution
It provides a theoretical comparison and empirical assessment of these methods' effectiveness for large-scale PCA, clarifying their utility in statistical applications.
Findings
Theoretical bounds on subspace approximation error.
Trade-offs between accuracy and computational efficiency.
Empirical validation on real-world email data.
Abstract
In this paper we analyze approximate methods for undertaking a principal components analysis (PCA) on large data sets. PCA is a classical dimension reduction method that involves the projection of the data onto the subspace spanned by the leading eigenvectors of the covariance matrix. This projection can be used either for exploratory purposes or as an input for further analysis, e.g. regression. If the data have billions of entries or more, the computational and storage requirements for saving and manipulating the design matrix in fast memory is prohibitive. Recently, the Nystr\"om and column-sampling methods have appeared in the numerical linear algebra community for the randomized approximation of the singular value decomposition of large matrices. However, their utility for statistical applications remains unclear. We compare these approximations theoretically by bounding the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
