LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures
Kelun Lei, Hailong Yang, Kaige Zhang, Kejie Ma, Yiqing Wang, Xin You, Yufan Xu, Enrique S. Quintana-Orti, Zhongzhi Luan, Yi Liu, Depei Qian

TL;DR
LOOPS is a novel hybrid framework that efficiently accelerates sparse matrix-matrix multiplication on Arm SME architectures, achieving high speedups and energy efficiency compared to CPU and GPU baselines.
Contribution
It introduces a hybrid execution framework combining CSR and BCSR layouts to exploit SME and SIMD resources for unstructured sparse workloads.
Findings
Achieves up to 14.4× speedup over TACO on CPU.
Delivers up to 33.5× speedup over GPU methods on A100.
Significantly improves energy efficiency compared to GPU implementations.
Abstract
Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Big Data and Digital Economy
