AutoTSMM: An Auto-tuning Framework for Building High-Performance   Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Chendi Li; Haipeng Jia; Hang Cao; Jianyu Yao; Boqian Shi; Chunyang; Xiang; Jinbo Sun; Pengqi Lu; Yunquan Zhang

arXiv:2208.08088·cs.DC·January 24, 2025

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang, Xiang, Jinbo Sun, Pengqi Lu, Yunquan Zhang

PDF

TL;DR

AutoTSMM is an auto-tuning framework that optimizes tall-and-skinny matrix-matrix multiplication on CPUs, significantly improving performance over existing methods by selecting optimal kernels and execution plans.

Contribution

It introduces a novel auto-tuning framework specifically designed for efficient tall-and-skinny matrix multiplication on CPUs, addressing a gap in existing optimization techniques.

Findings

01

AutoTSMM achieves competitive performance with state-of-the-art methods.

02

It outperforms all conventional matrix-matrix multiplication implementations.

03

The framework effectively selects optimal kernels and execution plans for tall-and-skinny matrices.

Abstract

In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.