Scalable Manifold Learning for Big Data with Apache Spark

Frank Schoeneman; Jaroslaw Zola

arXiv:1808.10776·cs.DC·September 3, 2018

Scalable Manifold Learning for Big Data with Apache Spark

Frank Schoeneman, Jaroslaw Zola

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable, distributed implementation of the exact Isomap manifold learning algorithm using Apache Spark, enabling processing of large-scale datasets efficiently without secondary storage reliance.

Contribution

It presents a novel distributed memory framework for exact Isomap in Spark, optimizing each step for scalability and efficiency on large datasets.

Findings

01

Demonstrates excellent scalability on a 25-node cluster.

02

Enables processing of datasets orders of magnitude larger than previous methods.

03

Achieves end-to-end exact Isomap computation in a distributed environment.

Abstract

Non-linear spectral dimensionality reduction methods, such as Isomap, remain important technique for learning manifolds. However, due to computational complexity, exact manifold learning using Isomap is currently impossible from large-scale data. In this paper, we propose a distributed memory framework implementing end-to-end exact Isomap under Apache Spark model. We show how each critical step of the Isomap algorithm can be efficiently realized using basic Spark model, without the need to provision data in the secondary storage. We show how the entire method can be implemented using PySpark, offloading compute intensive linear algebra routines to BLAS. Through experimental results, we demonstrate excellent scalability of our method, and we show that it can process datasets orders of magnitude larger than what is currently possible, using a 25-node parallel~cluster.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/SCoRe-Group/IsomapSpark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Clustering Algorithms Research · Image Retrieval and Classification Techniques