BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models
Phuoc-Hoan Charles Le, Xinlin Li

TL;DR
BinaryViT introduces architectural modifications inspired by CNNs to enhance the performance of binary vision transformers, achieving competitive results on ImageNet-1k without convolutions.
Contribution
The paper proposes BinaryViT, a novel binary vision transformer architecture that incorporates CNN-inspired operations to improve binary ViT performance without using convolutions.
Findings
BinaryViT achieves competitive accuracy with state-of-the-art binary CNNs.
Architectural modifications significantly improve binary ViT representational capacity.
BinaryViT reduces computational cost while maintaining high performance.
Abstract
With the increasing popularity and the increasing size of vision transformers (ViTs), there has been an increasing interest in making them more efficient and less computationally costly for deployment on edge devices with limited computing resources. Binarization can be used to help reduce the size of ViT models and their computational cost significantly, using popcount operations when the weights and the activations are in binary. However, ViTs suffer a larger performance drop when directly applying convolutional neural network (CNN) binarization methods or existing binarization methods to binarize ViTs compared to CNNs on datasets with a large number of classes such as ImageNet-1k. With extensive analysis, we find that binary vanilla ViTs such as DeiT miss out on a lot of key architectural properties that CNNs have that allow binary CNNs to have much higher representational capability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Feedforward Network · Average Pooling · Attention Dropout · Data-efficient Image Transformer
