SISA: A Scale-In Systolic Array for GEMM Acceleration
Luigi Altamura, Alessio Cicero, Mateo V\'azquez Maceiras, Mohammad Ali Maleki, Pedro Trancoso

TL;DR
SISA is a novel systolic array architecture that partitions the array into slabs to better handle skewed matrices in LLMs, significantly improving efficiency over traditional square arrays.
Contribution
The paper introduces SISA, a new SA design that enhances GEMM acceleration for LLMs by partitioning the array into slabs for better resource utilization.
Findings
Achieves up to 8.52x speedup on representative LLMs.
Reduces energy-delay-product by 93% compared to monolithic SAs.
Effectively handles input-dependent and skewed matrices in LLM workloads.
Abstract
The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
