RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU
Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes,, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

TL;DR
RASA introduces a register-aware systolic array architecture for CPUs that enhances performance by overlapping execution stages, reducing stalls and under-utilization with minimal area and power overhead.
Contribution
The paper proposes RASA, a novel register-aware systolic array design that improves CPU matrix engine efficiency through execution stage subdivision and instruction overlapping.
Findings
Performance improves significantly with RASA.
Negligible area and power overhead.
Effective in reducing stalls and under-utilization.
Abstract
As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and drain times of the array. To address this, we propose RASA, Register-Aware Systolic Array. We develop techniques to divide an execution stage into several sub-stages and overlap instructions to hide overheads and run them concurrently. RASA-based designs improve performance significantly with negligible area and power overhead.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
