BiBERT: Accurate Fully Binarized BERT
Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua Yan, Aishan Liu,, Qingqing Dang, Ziwei Liu, Xianglong Liu

TL;DR
BiBERT is a fully binarized BERT model that significantly reduces computation and memory costs while maintaining high performance, achieved through novel attention and distillation techniques.
Contribution
This paper introduces BiBERT, the first fully binarized BERT, with new attention and distillation methods to overcome performance drops in binarization.
Findings
Outperforms existing quantized BERTs on NLP benchmarks
Achieves 56.3x FLOPs and 31.2x model size reduction
Maintains competitive accuracy with ultra-low bit activations
Abstract
The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Weight Decay · Layer Normalization · Bilinear Attention · Linear Warmup With Linear Decay
