A Flexible Instruction Set Architecture for Efficient GEMMs
Alexandre de Limas Santana, Adri\`a Armejach, Francesc Martinez, Erich Focht, Marc Casas

TL;DR
This paper introduces the Matrix Tile Extension (MTE), a flexible matrix ISA that enhances GEMM performance by decoupling instruction set architecture from microarchitecture, achieving significant speed-ups over existing solutions.
Contribution
The paper presents MTE, the first matrix ISA that interacts seamlessly with vector ISAs and adapts dynamically to application-specific data formats, improving GEMM efficiency.
Findings
MTE achieves 1.35x speed-up over state-of-the-art matrix ISAs.
MTE can vectorize GEMMs across M, N, and K dimensions.
MTE requires minimal implementation overhead with few instructions and a 64-bit CSR.
Abstract
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
