TL;DR
This paper introduces a randomized preconditioning and sampling scheme for compressing large datasets, enabling faster PCA and K-means clustering with theoretical guarantees and practical benefits.
Contribution
It proposes a novel data sparsification method that combines randomized preconditioning with sampling, improving efficiency and providing theoretical error bounds for PCA and K-means.
Findings
The method achieves near-tight theoretical guarantees for PCA and K-means.
Numerical experiments show significant speedups and accuracy benefits on standard datasets.
The approach outperforms some existing sampling techniques in practical scenarios.
Abstract
We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
