Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication   on FPGA

Xuqi Zhu; Huaizhi Zhang; JunKyu Lee; Jiacheng Zhu; Chandrajit Pal,; Sangeet Saha; Klaus D. McDonald-Maier; Xiaojun Zhai

arXiv:2407.02362·cs.AR·July 9, 2024·1 cites

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal,, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai

PDF

Open Access

TL;DR

This paper introduces a high-throughput, energy-efficient FPGA-based approximate matrix multiplication unit that significantly accelerates neural network computations by reducing redundancy and optimizing memory access.

Contribution

The paper presents the design of a novel Approximate Multiplication Unit (AMU) that improves scalability and energy efficiency for matrix multiplication on FPGAs, surpassing existing solutions.

Findings

01

Up to 9x higher throughput compared to state-of-the-art.

02

112x higher energy efficiency over existing FPGA solutions.

03

Effective reduction of redundancies in LUT-based matrix multiplication.

Abstract

Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLow-power high-performance VLSI design · VLSI and FPGA Design Techniques · Interconnection Networks and Systems