Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Milan Shah; Sheng Di; Michela Becchi

arXiv:2604.27985·cs.DC·May 1, 2026

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Milan Shah, Sheng Di, Michela Becchi

PDF

TL;DR

This paper investigates the performance of sparse matrix multiplication kernels on the Cerebras CS-3 accelerator, proposing optimized designs and benchmarking against CPU performance for various sparsity levels.

Contribution

It introduces low-level kernel designs for sparse-dense and sampled dense-dense matrix multiplication on the CS-3 and evaluates their scalability and efficiency.

Findings

01

CS-3 outperforms CPU by 100x for 90% sparse SpMM.

02

CS-3 outperforms CPU by 20x for 90% sparse SDDMM.

03

Performance degrades beyond 99% sparsity, making CS-3 slower than CPU.

Abstract

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3. In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.