Preconditioned Data Sparsification for Big Data with Applications to PCA   and K-means

Farhad Pourkamali-Anaraki; Stephen Becker

arXiv:1511.00152·stat.ML·February 24, 2017

Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

Farhad Pourkamali-Anaraki, Stephen Becker

PDF

2 Repos

TL;DR

This paper introduces a randomized preconditioning and sampling scheme for compressing large datasets, enabling faster PCA and K-means clustering with theoretical guarantees and practical benefits.

Contribution

It proposes a novel data sparsification method that combines randomized preconditioning with sampling, improving efficiency and providing theoretical error bounds for PCA and K-means.

Findings

01

The method achieves near-tight theoretical guarantees for PCA and K-means.

02

Numerical experiments show significant speedups and accuracy benefits on standard datasets.

03

The approach outperforms some existing sampling techniques in practical scenarios.

Abstract

We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPrincipal Components Analysis