BiPFT: Binary Pre-trained Foundation Transformer with Low-rank   Estimation of Binarization Residual Polynomials

Xingrun Xing; Li Du; Xinyuan Wang; Xianlin Zeng; Yequan Wang; Zheng; Zhang; Jiajun Zhang

arXiv:2312.08937·cs.LG·June 21, 2024·1 cites

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

Xingrun Xing, Li Du, Xinyuan Wang, Xianlin Zeng, Yequan Wang, Zheng, Zhang, Jiajun Zhang

PDF

Open Access 1 Repo

TL;DR

BiPFT introduces a binary pretrained transformer that significantly reduces computational resources while maintaining high performance on NLU tasks, achieved through novel low-rank estimation of binarization residual polynomials.

Contribution

This work presents the first binary pretrained transformer for NLP, employing low-rank estimators of binarization residuals to enhance binary neural network capabilities.

Findings

01

Surpasses baseline by 15.4% on GLUE benchmark

02

Reduces operations by 56 times and memory by 28 times

03

Improves robustness and efficiency of BNNs

Abstract

Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xingrun-xing/bipft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings