TerViT: An Efficient Ternary Vision Transformer

Sheng Xu; Yanjing Li; Teli Ma; Bohan Zeng; Baochang Zhang; Peng Gao; and Jinhu Lv

arXiv:2201.08050·cs.CV·January 24, 2022·5 cites

TerViT: An Efficient Ternary Vision Transformer

Sheng Xu, Yanjing Li, Teli Ma, Bohan Zeng, Baochang Zhang, Peng Gao, and Jinhu Lv

PDF

Open Access

TL;DR

TerViT introduces a ternary weight quantization method for vision transformers, significantly reducing model size and computational cost while maintaining competitive accuracy on ImageNet.

Contribution

The paper proposes a novel progressive training scheme and channel-wise ternarization for ViTs, enabling efficient deployment on resource-constrained devices.

Findings

01

Achieves 79% Top-1 accuracy with a 13.1MB Swin-S model.

02

Outperforms conventional quantization methods in training stability and accuracy.

03

Demonstrates competitive performance on ImageNet dataset.

Abstract

Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies

MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Dropout · Layer Normalization · Softmax · Residual Connection · Feedforward Network · Data-efficient Image Transformer · Multi-Head Attention