High Performance Scalable FPGA Accelerator for Deep Neural Networks
Sudarshan Srinivasan, Pradeep Janedula, Saurabh Dhoble, Sasikanth, Avancha, Dipankar Das, Naveen Mellempudi, Bharat Daga, Martin Langhammer,, Gregg Baeckler, Bharat Kaul

TL;DR
This paper presents a high-performance FPGA accelerator for CNN inference using low-precision INT-8-2 compute, achieving performance metrics that surpass CPUs and GPUs and approach ASIC levels, while maintaining FPGA versatility.
Contribution
The work introduces a novel ALM-based FPGA design for INT-8-2 compute, enabling high AI-TOPS performance for CNN inference, a capability not supported by existing ASICs, CPUs, or GPUs.
Findings
Achieves 5 AI-TOPS on Arria10 FPGA.
Projects 76 AI-TOPS at 0.7 TOPS/W on Stratix10.
Surpasses known CPU and GPU performance, approaching ASIC levels.
Abstract
Low-precision is the first order knob for achieving higher Artificial Intelligence Operations (AI-TOPS). However the algorithmic space for sub-8-bit precision compute is diverse, with disruptive changes happening frequently, making FPGAs a natural choice for Deep Neural Network inference, In this work we present an FPGA-based accelerator for CNN inference acceleration. We use {\it INT-8-2} compute (with {\it 8 bit} activation and {2 bit} weights) which is recently showing promise in the literature, and which no known ASIC, CPU or GPU natively supports today. Using a novel Adaptive Logic Module (ALM) based design, as a departure from traditional DSP based designs, we are able to achieve high performance measurement of 5 AI-TOPS for {\it Arria10} and project a performance of 76 AI-TOPS at 0.7 TOPS/W for {\it Stratix10}. This exceeds known CPU, GPU performance and comes close to best known…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
