A Flexible Instruction Set Architecture for Efficient GEMMs

Alexandre de Limas Santana; Adri\`a Armejach; Francesc Martinez; Erich Focht; Marc Casas

arXiv:2507.03522·cs.AR·July 8, 2025

A Flexible Instruction Set Architecture for Efficient GEMMs

Alexandre de Limas Santana, Adri\`a Armejach, Francesc Martinez, Erich Focht, Marc Casas

PDF

TL;DR

This paper introduces the Matrix Tile Extension (MTE), a flexible matrix ISA that enhances GEMM performance by decoupling instruction set architecture from microarchitecture, achieving significant speed-ups over existing solutions.

Contribution

The paper presents MTE, the first matrix ISA that interacts seamlessly with vector ISAs and adapts dynamically to application-specific data formats, improving GEMM efficiency.

Findings

01

MTE achieves 1.35x speed-up over state-of-the-art matrix ISAs.

02

MTE can vectorize GEMMs across M, N, and K dimensions.

03

MTE requires minimal implementation overhead with few instructions and a 64-bit CSR.

Abstract

GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.