Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision   Transformer with Mixed-Scheme Quantization

Zhengang Li; Mengshu Sun; Alec Lu; Haoyu Ma; Geng Yuan; Yanyue Xie,; Hao Tang; Yanyu Li; Miriam Leeser; Zhangyang Wang; Xue Lin; Zhenman Fang

arXiv:2208.05163·cs.CV·August 11, 2022·6 cites

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie,, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang

PDF

Open Access

TL;DR

This paper introduces an FPGA-aware framework for accelerating Vision Transformers using mixed-scheme quantization, achieving higher accuracy and significantly improved frame rates compared to traditional FPGA implementations.

Contribution

It presents the first FPGA-based ViT acceleration framework that incorporates model quantization, enhancing performance and accuracy over existing methods.

Findings

01

Achieves 5.6x higher frame rate than baseline FPGA accelerator.

02

Improves Top-1 accuracy by up to 1.36% with quantization.

03

Maintains competitive accuracy with a 0.71% drop on ImageNet.

Abstract

Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications