TL;DR
FusedMM is a unified, high-performance kernel that accelerates graph embedding and GNN computations by combining sampled dense-dense and sparse-dense matrix multiplications, achieving significant speedups across various processors.
Contribution
The paper introduces FusedMM, a novel unified kernel that efficiently combines different matrix multiplication patterns for graph algorithms, outperforming existing solutions.
Findings
FusedMM is up to 28x faster than existing kernels.
It performs well on Intel, AMD, and ARM processors.
FusedMM accelerates end-to-end graph embedding algorithms.
Abstract
We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM. By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches. FusedMM is an order of magnitude faster than its equivalent kernels in Deep Graph Library. The superior performance of FusedMM comes from the low-level vectorized kernels, a suitable load balancing scheme and an efficient utilization of the memory bandwidth. FusedMM can tune its performance using a code generator and perform equally well on Intel, AMD and ARM processors. FusedMM speeds up an end-to-end graph embedding algorithm by up to 28x on different processors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
