Demystifying ARM SME to Optimize General Matrix Multiplications

Chencheng Deng; Weiling Yang; Jianbin Fang; Dezun Dong

arXiv:2512.21473·cs.DC·December 29, 2025

Demystifying ARM SME to Optimize General Matrix Multiplications

Chencheng Deng, Weiling Yang, Jianbin Fang, Dezun Dong

PDF

Open Access

TL;DR

This paper introduces MpGEMM, an open-source library optimized for ARM's SME to accelerate GEMM operations, demonstrating significant performance improvements on real-world workloads.

Contribution

It systematically characterizes ARM SME features and develops optimization techniques, resulting in a highly efficient GEMM library for multiple precisions.

Findings

01

Achieves 1.23x speedup over Apple Accelerate library.

02

Outperforms other open-source GEMM libraries.

03

Effective utilization of SME features for large matrix multiplications.

Abstract

General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing linear algebra libraries fail to fully exploit its potential, particularly for large matrices. This paper presents MpGEMM, an open-source library that leverages key architectural features of SME to optimize GEMM across multiple precisions. Through a systematic characterization of SME, we derive optimization guidelines that inform our design. MpGEMM employs cache-aware partitioning, efficient data packing with on-the-fly transposition, and specialized micro-kernels that utilize multi-vector loads and all available tile registers. Evaluated on an Apple M4 Pro with real-world workloads from DeepSeek and LLaMA, MpGEMM achieves an average speedup of 1.23x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Low-power high-performance VLSI design