VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network
Kuo-Wei Chang, Tian-Sheuan Chang

TL;DR
This paper introduces a hardware-efficient vectorwise CNN accelerator with a 3x3 systolic array and 1-D dataflow, achieving high utilization, low area, and power efficiency for various CNN models.
Contribution
It presents a reconfigurable, simple dataflow CNN accelerator design that improves hardware utilization and reduces area and power costs.
Findings
Achieves over 93% hardware utilization on multiple CNN models.
Supports 168 GOPS throughput with low power consumption.
Uses a 40nm implementation with 266.9K NAND gates and 191KB SRAM.
Abstract
Hardware accelerators for convolution neural networks (CNNs) enable real-time applications of artificial intelligence technology. However, most of the existing designs suffer from low hardware utilization or high area cost due to complex dataflow. This paper proposes a hardware efficient vectorwise CNN accelerator that adopts a 33 filter optimized systolic array using 1-D broadcast dataflow to generate partial sum. This enables easy reconfiguration for different kinds of kernels with interleaved input or elementwise input dataflow. This simple and regular data flow results in low area cost while attains high hardware utilization. The presented design achieves 99\%, 97\%, 93.7\%, 94\% hardware utilization for VGG-16, ResNet-34, GoogLeNet, and Mobilenet, respectively. Hardware implementation with TSMC 40nm technology takes 266.9K NAND gate count and 191KB SRAM to support 168GOPS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
