Projecting "better than randomly": How to reduce the dimensionality of very large datasets in a way that outperforms random projections
Michael Wojnowicz, Di Zhang, Glenn Chisholm, Xuan Zhao, Matt Wolff

TL;DR
This paper introduces LS-RPCA, a new algorithm for large-scale randomized principal component analysis, demonstrating it significantly outperforms random projections in supervised learning tasks on massive datasets.
Contribution
The paper develops LS-RPCA, extending randomized PCA to handle arbitrarily large datasets, and shows it improves classification accuracy over random projections.
Findings
LS-RPCA reduces classification error by 37-54% compared to random projections.
LS-RPCA scales to datasets with over 10 million samples and 100,000 features.
Randomized PCA can outperform random projections when dataset rank and accuracy are critical.
Abstract
For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1, study a malware classification task on a dataset with over 10 million samples, almost 100,000 features, and over 25 billion non-zero values, with the goal of reducing the dimensionality to a compressed representation of 5,000 features. In order to apply RPCA to this dataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which extends the RPCA algorithm to work on datasets with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
