BinaryBERT: Pushing the Limit of BERT Quantization
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun, Liu, Michael Lyu, Irwin King

TL;DR
BinaryBERT introduces a novel weight binarization method for BERT, achieving significant compression with minimal performance loss by leveraging ternary weight splitting and fine-tuning.
Contribution
The paper presents a new approach to BERT quantization using weight binarization with ternary weight splitting, enabling effective training and high compression rates.
Findings
BinaryBERT is 24x smaller than full-precision BERT.
It achieves state-of-the-art compression on GLUE and SQuAD.
Performance drop is minimal compared to full-precision models.
Abstract
The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · TernaryBERT · Ternary Weight Splitting · BinaryBERT · Dropout · Softmax · Linear Warmup With Linear Decay · Dense Connections · Attention Dropout · Attention Is All You Need
