# Full-stack Optimization for Accelerating CNNs with FPGA Validation

**Authors:** Bradley McDanel, Sai Qian Zhang, H. T. Kung, Xin Dong

arXiv: 1905.00462 · 2019-05-03

## TL;DR

This paper introduces a comprehensive full-stack optimization framework for CNN inference acceleration on FPGAs, achieving low latency and high energy efficiency through joint model, architecture, and hardware optimization validated by a high-performance FPGA implementation.

## Contribution

It presents a novel full-stack optimization approach for CNNs on FPGAs, including an efficient SAC architecture that significantly improves energy efficiency and resource utilization.

## Key findings

- Achieved 2.28ms latency on ImageNet benchmark with FPGA.
- SAC architecture reduces hardware resources by 4.85x and power consumption by 2.48x.
- FPGA implementation is among the fastest and most energy-efficient reported.

## Abstract

We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA inference chip achieving 2.28ms latency for the ImageNet benchmark. The achieved latency is among the lowest reported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware. For instance, compared to a FPGA implementation for a traditional 8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer Look-up Tables) and power consumption (2.48x).

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.00462/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/1905.00462/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/1905.00462/full.md

---
Source: https://tomesphere.com/paper/1905.00462