TerViT: An Efficient Ternary Vision Transformer
Sheng Xu, Yanjing Li, Teli Ma, Bohan Zeng, Baochang Zhang, Peng Gao, and Jinhu Lv

TL;DR
TerViT introduces a ternary weight quantization method for vision transformers, significantly reducing model size and computational cost while maintaining competitive accuracy on ImageNet.
Contribution
The paper proposes a novel progressive training scheme and channel-wise ternarization for ViTs, enabling efficient deployment on resource-constrained devices.
Findings
Achieves 79% Top-1 accuracy with a 13.1MB Swin-S model.
Outperforms conventional quantization methods in training stability and accuracy.
Demonstrates competitive performance on ImageNet dataset.
Abstract
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · Linear Layer · Attention Dropout · Dropout · Layer Normalization · Softmax · Residual Connection · Feedforward Network · Data-efficient Image Transformer · Multi-Head Attention
