D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
Ahmed J. Abdelmaksoud, Cristian Sestito, Shiwei Wang, and Themis Prodromakis

TL;DR
D-Legion is a scalable many-core architecture with adaptive-precision systolic arrays designed to accelerate matrix multiplication in quantized large language models, significantly improving latency and memory efficiency.
Contribution
The paper introduces D-Legion, a novel architecture with adaptive-precision cores and optimized scheduling, tailored for quantized LLM workloads, outperforming existing solutions.
Findings
Up to 8.2× lower latency compared to state-of-the-art.
Up to 3.8× higher memory savings.
Achieves 135.68 TOPS at 1 GHz with 8 Legions.
Abstract
The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
