Distributed-Memory Sparse Kernels for Machine Learning
Vivek Bharadwaj, Aydin Bulu\c{c}, James Demmel

TL;DR
This paper develops and benchmarks distributed-memory algorithms for fused sparse-dense matrix operations, significantly reducing communication costs and accelerating large-scale machine learning tasks.
Contribution
It introduces novel communication-eliding strategies for fused SDDMM and SpMM kernels, extending distributed algorithms to improve efficiency in machine learning applications.
Findings
Fused algorithms save at least 30% communication time compared to sequential execution.
Achieve at least 10x speedup over PETSc's SpMM on large real-world matrices.
Communication-eliding techniques improve runtime by up to 1.6 times over unoptimized sequences.
Abstract
Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Functional Brain Connectivity Studies · Stochastic Gradient Optimization Techniques
