Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
Milan Shah, Sheng Di, Michela Becchi

TL;DR
This paper investigates the performance of sparse matrix multiplication kernels on the Cerebras CS-3 accelerator, proposing optimized designs and benchmarking against CPU performance for various sparsity levels.
Contribution
It introduces low-level kernel designs for sparse-dense and sampled dense-dense matrix multiplication on the CS-3 and evaluates their scalability and efficiency.
Findings
CS-3 outperforms CPU by 100x for 90% sparse SpMM.
CS-3 outperforms CPU by 20x for 90% sparse SDDMM.
Performance degrades beyond 99% sparsity, making CS-3 slower than CPU.
Abstract
In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3. In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
