BHViT: Binarized Hybrid Vision Transformer
Tian Gao, Zhiyuan Zhang, Yu Zhang, Huajun Liu, Kaijie Yin, and Chengzhong Xu, Hui Kong

TL;DR
BHViT introduces a novel binarized hybrid Vision Transformer architecture that effectively combines local information interaction, shift-based modules, and attention binarization to achieve state-of-the-art performance in energy-efficient edge deployment.
Contribution
The paper proposes BHViT, a binarization-friendly hybrid ViT architecture with new modules and techniques to improve performance of binary Vision Transformers.
Findings
Achieves state-of-the-art results among binary ViT methods.
Effectively reduces computational redundancy in token processing.
Enhances binary MLP performance with shift operation modules.
Abstract
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN), offering a potential solution to the deployment challenges faced by Vision Transformers (ViTs) on edge devices. However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Initially, BHViT utilizes the local information interaction and hierarchical feature aggregation technique from coarse to fine levels to address redundant computations stemming from excessive tokens. Then, a novel module based on shift operations is proposed to enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
