Fast and Faster: A Comparison of Two Streamed Matrix Decomposition   Algorithms

Radim \v{R}eh{\r{u}}\v{r}ek

arXiv:1102.5597·cs.NA·August 14, 2016·2 cites

Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms

Radim \v{R}eh{\r{u}}\v{r}ek

PDF

Open Access

TL;DR

This paper compares two streaming matrix decomposition algorithms suitable for large datasets, analyzing their accuracy and performance trade-offs in practical, real-world scenarios like processing the entire English Wikipedia for Latent Semantic Analysis.

Contribution

It provides a practical comparison of a single-pass distributed method and a two-pass stochastic algorithm for large-scale matrix decomposition.

Findings

01

Distributed method performs well with fewer passes

02

Oversampling improves accuracy in both algorithms

03

Memory trade-offs significantly affect performance

Abstract

With the explosion of the size of digital dataset, the limiting factor for decomposition algorithms is the \emph{number of passes} over the input, as the input is often stored out-of-core or even off-site. Moreover, we're only interested in algorithms that operate in \emph{constant memory} w.r.t. to the input size, so that arbitrarily large input can be processed. In this paper, we present a practical comparison of two such algorithms: a distributed method that operates in a single pass over the input vs. a streamed two-pass stochastic algorithm. The experiments track the effect of distributed computing, oversampling and memory trade-offs on the accuracy and performance of the two algorithms. To ensure meaningful results, we choose the input to be a real dataset, namely the whole of the English Wikipedia, in the application settings of Latent Semantic Analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Neural Networks and Applications · Machine Learning and Algorithms