An algorithm for the principal component analysis of large data sets

Nathan Halko; Per-Gunnar Martinsson; Yoel Shkolnisky; and Mark Tygert

arXiv:1007.5510·stat.CO·December 23, 2011·5 cites

An algorithm for the principal component analysis of large data sets

Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert

PDF

Open Access

TL;DR

This paper introduces an out-of-core randomized PCA algorithm suitable for very large datasets that cannot fit into RAM, demonstrating its effectiveness through numerical experiments.

Contribution

It adapts a randomized PCA method for out-of-core data processing, enabling efficient analysis of datasets too large for memory.

Findings

01

Successfully performed PCA on a dataset mostly stored on disk.

02

The algorithm achieves near-optimal accuracy comparable to in-memory methods.

03

Demonstrated scalability and efficiency on large, disk-resident datasets.

Abstract

Recently popularized randomized methods for principal component analysis (PCA) efficiently and reliably produce nearly optimal accuracy --- even on parallel processors --- unlike the classical (deterministic) alternatives. We adapt one of these randomized methods for use with data sets that are too large to be stored in random-access memory (RAM). (The traditional terminology is that our procedure works efficiently "out-of-core.") We illustrate the performance of the algorithm via several numerical examples. For example, we report on the PCA of a data set stored on disk that is so large that less than a hundredth of it can fit in our computer's RAM.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Face and Expression Recognition