A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Yufan Xia; Marco De La Pierre; Amanda S. Barnard; Giuseppe Maria Junior Barca

arXiv:2601.09114·cs.DC·January 15, 2026

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Yufan Xia, Marco De La Pierre, Amanda S. Barnard, Giuseppe Maria Junior Barca

PDF

Open Access

TL;DR

This paper introduces ADSALA, a machine learning-based library that dynamically optimizes the number of threads for GEMM matrix multiplication on multi-core systems, achieving significant speedups.

Contribution

It presents a novel approach using machine learning to automatically select optimal threading for GEMM, improving performance on modern HPC architectures.

Findings

01

Achieved 25-40% speedup over traditional GEMM implementations.

02

Effective on both Intel Cascade Lake and AMD Zen 3 architectures.

03

Optimization is particularly beneficial for GEMM tasks within 100 MB memory usage.

Abstract

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Low-power high-performance VLSI design