Optimal Subarchitecture Extraction For BERT
Adrian de Wynter, Daniel J. Perry

TL;DR
This paper introduces 'Bort', an optimized, smaller subset of BERT architecture, achieved through neural architecture search, which significantly reduces training time and improves performance on NLP benchmarks.
Contribution
We present a neural architecture search method to extract an optimal, smaller BERT subarchitecture, reducing size and training time while enhancing performance.
Findings
Bort is 5.5% the size of BERT-large.
Pretraining Bort takes only 288 GPU hours, 1.2% of RoBERTa-large.
Bort outperforms BERT-large and other variants on NLP benchmarks.
Abstract
We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of the original BERT-large architecture, and of the net size. Bort is also able to be pretrained in GPU hours, which is of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al., 2019), and about of that of the world-record, in GPU hours, required to train BERT-large on the same hardware. It is also x faster on a CPU, as well as being better performing than other compressed variants of the architecture, and some of the non-compressed variants: it obtains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
MethodsLinear Layer · Bort · Dense Connections · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Residual Connection
