Optimal Subarchitecture Extraction For BERT

Adrian de Wynter; Daniel J. Perry

arXiv:2010.10499·cs.CL·November 10, 2020·23 cites

Optimal Subarchitecture Extraction For BERT

Adrian de Wynter, Daniel J. Perry

PDF

Open Access 3 Repos 1 Models

TL;DR

This paper introduces 'Bort', an optimized, smaller subset of BERT architecture, achieved through neural architecture search, which significantly reduces training time and improves performance on NLP benchmarks.

Contribution

We present a neural architecture search method to extract an optimal, smaller BERT subarchitecture, reducing size and training time while enhancing performance.

Findings

01

Bort is 5.5% the size of BERT-large.

02

Pretraining Bort takes only 288 GPU hours, 1.2% of RoBERTa-large.

03

Bort outperforms BERT-large and other variants on NLP benchmarks.

Abstract

We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of $5.5%$ the original BERT-large architecture, and $16%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2%$ of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al., 2019), and about $33%$ of that of the world-record, in GPU hours, required to train BERT-large on the same hardware. It is also $7.9$ x faster on a CPU, as well as being better performing than other compressed variants of the architecture, and some of the non-compressed variants: it obtains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
amazon/bort
model· 17 dl· ♡ 18
17 dl♡ 18

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms

MethodsLinear Layer · Bort · Dense Connections · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Residual Connection