Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension
Stefan Remke, Alexander Breuer

TL;DR
This paper studies the Arm M4's Scalable Matrix Extension (SME), benchmarks its performance, and introduces a just-in-time code generator that outperforms existing BLAS implementations for small matrix multiplications.
Contribution
It provides an in-depth analysis of SME on M4 and presents a novel JIT code generator that enhances small matrix multiplication performance.
Findings
SME on M4 achieves over 2.3 FP32 TFLOPS throughput.
Optimized load/store strategies are essential for bandwidth utilization.
JIT kernels outperform vendor-optimized BLAS in most cases.
Abstract
Modern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set architectures (ISAs) were designed for vector parallelism. In recent years, major hardware vendors have further increased the throughput of their CPUs by introducing matrix units with corresponding ISA extensions. The Scalable Matrix Extension (SME) has been announced for the Arm architecture in 2021 and Apple's M4 chip is the first to support SME. This paper presents an in-depth study of SME on M4. Our microbenchmarks determine the maximum floating-point and fixed-point throughput of M4's SME acceleration and study the achievable bandwidth for transfers to and from the matrix registers. Furthermore, we used the insights gained to design a just-in-time code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Computational Physics and Python Applications
