TL;DR
This paper introduces a new barrier mode in Apache Spark that enables efficient distributed matrix multiplication using Cannon's algorithm, significantly improving performance and reducing memory usage for large matrices, with applications in deep learning.
Contribution
It presents a novel integration of Cannon's algorithm into Spark's barrier execution mode, enhancing distributed matrix multiplication performance.
Findings
Up to 24% performance improvement on 10,000x10,000 matrices
Significantly lower memory footprint compared to existing implementations
Enables faster deep learning training workflows
Abstract
The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage does not depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon's distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon's algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24 percent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
