TL;DR
This paper introduces a high-level synthesis approach for FPGA-based matrix multiplication that minimizes data movement and maximizes performance, supporting arbitrary data types and ensuring portability across FPGA devices.
Contribution
It presents a new model and architecture for FPGA matrix multiplication that optimizes I/O and performance using high-level synthesis, with an open-source implementation.
Findings
Achieves competitive performance scaling with compute and memory resources.
Supports arbitrary data types through high-level synthesis.
Provides an open-source, portable FPGA matrix multiplication solution.
Abstract
Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
