BEBERT: Efficient and Robust Binary Ensemble BERT
Jiayi Tian, Chao Fang, Haonan Wang, Zhongfeng Wang

TL;DR
BEBERT introduces an ensemble of binary BERT models that significantly improves accuracy and robustness while maintaining computational efficiency, reducing training time and model size compared to full-precision BERT.
Contribution
This work is the first to apply ensemble techniques to binary BERT models, achieving superior accuracy and robustness without knowledge distillation, and demonstrating practical efficiency gains.
Findings
BEBERT outperforms existing binary BERT models in accuracy and robustness.
BEBERT achieves a 2x speedup in training time.
BEBERT reduces model size by 13x and FLOPs by 15x with minimal accuracy loss.
Abstract
Pre-trained BERT models have achieved impressive accuracy on natural language processing (NLP) tasks. However, their excessive amount of parameters hinders them from efficient deployment on edge devices. Binarization of the BERT models can significantly alleviate this issue but comes with a severe accuracy drop compared with their full-precision counterparts. In this paper, we propose an efficient and robust binary ensemble BERT (BEBERT) to bridge the accuracy gap. To the best of our knowledge, this is the first work employing ensemble techniques on binary BERTs, yielding BEBERT, which achieves superior accuracy while retaining computational efficiency. Furthermore, we remove the knowledge distillation procedures during ensemble to speed up the training process without compromising accuracy. Experimental results on the GLUE benchmark show that the proposed BEBERT significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection
