An algorithm for the principal component analysis of large data sets
Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert

TL;DR
This paper introduces an out-of-core randomized PCA algorithm suitable for very large datasets that cannot fit into RAM, demonstrating its effectiveness through numerical experiments.
Contribution
It adapts a randomized PCA method for out-of-core data processing, enabling efficient analysis of datasets too large for memory.
Findings
Successfully performed PCA on a dataset mostly stored on disk.
The algorithm achieves near-optimal accuracy comparable to in-memory methods.
Demonstrated scalability and efficiency on large, disk-resident datasets.
Abstract
Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy --- even on parallel processors --- unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Face and Expression Recognition
