BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials
Xingrun Xing, Li Du, Xinyuan Wang, Xianlin Zeng, Yequan Wang, Zheng, Zhang, Jiajun Zhang

TL;DR
BiPFT introduces a binary pretrained transformer that significantly reduces computational resources while maintaining high performance on NLU tasks, achieved through novel low-rank estimation of binarization residual polynomials.
Contribution
This work presents the first binary pretrained transformer for NLP, employing low-rank estimators of binarization residuals to enhance binary neural network capabilities.
Findings
Surpasses baseline by 15.4% on GLUE benchmark
Reduces operations by 56 times and memory by 28 times
Improves robustness and efficiency of BNNs
Abstract
Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings
