TL;DR
This paper introduces specialized multisystolic array hardware architectures optimized for Strassen's matrix multiplication algorithm, achieving significant resource savings and improved performance in FPGA implementations for machine learning accelerators.
Contribution
The paper presents novel multisystolic array architectures tailored for Strassen's algorithm, demonstrating hardware resource efficiency and performance improvements over prior designs.
Findings
FPGA implementations reduce DSP requirements by a factor of 1.14^r for r recursion levels.
Architectures support matrix sizes down to 32x32 and 24x24 with 1-2 recursion levels.
Achieves state-of-the-art performance in machine learning accelerators.
Abstract
While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
