LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient   Multiplication for Neural Network Inference

Yanyue Xie; Zhengang Li; Dana Diaconu; Suranga Handagala; Miriam; Leeser; Xue Lin

arXiv:2411.11852·cs.AR·November 20, 2024

LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

Yanyue Xie, Zhengang Li, Dana Diaconu, Suranga Handagala, Miriam, Leeser, Xue Lin

PDF

TL;DR

LUTMUL leverages look-up tables instead of traditional DSP blocks to perform multiplications in FPGA neural network accelerators, significantly boosting inference speed and surpassing conventional performance limits.

Contribution

This paper introduces LUTMUL, a LUT-based multiplication method that outperforms DSP-based approaches in FPGA neural network inference, setting new speed benchmarks.

Findings

01

Achieves 1627 images/sec inference throughput.

02

Maintains 70.95% top-1 accuracy on ImageNet.

03

Outperforms all existing FPGA accelerators in speed.

Abstract

For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings