Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
Alex Gittens, Kai Rothauge, Shusen Wang, Michael W. Mahoney, Lisa, Gerhardt, Prabhat, Jey Kottalam, Michael Ringenburg, Kristyn Maschhoff

TL;DR
This paper introduces Alchemist, a system that enables Apache Spark to offload linear algebra computations to MPI-based high-performance libraries, significantly accelerating large-scale data analysis tasks while maintaining Spark's usability.
Contribution
Alchemist provides a novel interface allowing Spark to seamlessly leverage MPI libraries, improving performance for linear algebra and machine learning computations.
Findings
Order of magnitude speedup in conjugate gradient method
Up to 7.9x faster SVD on large datasets
Scalable to datasets up to 17.6TB
Abstract
Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
