BiT: Robustly Binarized Multi-distilled Transformer

Zechun Liu; Barlas Oguz; Aasish Pappu; Lin Xiao; Scott Yih; Meng Li,; Raghuraman Krishnamoorthi; Yashar Mehdad

arXiv:2205.13016·cs.LG·October 4, 2022·22 cites

BiT: Robustly Binarized Multi-distilled Transformer

Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li,, Raghuraman Krishnamoorthi, Yashar Mehdad

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces BiT, a set of techniques enabling fully binarized transformer models that maintain high accuracy, making them practical for resource-limited environments, by combining innovative binarization, activation, and distillation methods.

Contribution

The paper presents a novel combination of binarization schemes, elastic binary activation functions, and progressive distillation to create highly accurate fully binarized transformers.

Findings

01

Achieves near full-precision BERT accuracy on GLUE benchmark

02

Introduces a two-set binarization scheme and elastic binary activation

03

Demonstrates effective model distillation to limit precision without large accuracy loss

Abstract

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

BiT: Robustly Binarized Multi-distilled Transformer· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Weight Decay · Dropout · Adam · WordPiece · Linear Warmup With Linear Decay · Attention Dropout