Token Dropping for Efficient BERT Pretraining

Le Hou; Richard Yuanzhe Pang; Tianyi Zhou; Yuexin Wu; Xinying Song,; Xiaodan Song; Denny Zhou

arXiv:2203.13240·cs.CL·March 25, 2022

Token Dropping for Efficient BERT Pretraining

Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song,, Xiaodan Song, Denny Zhou

PDF

Open Access

TL;DR

This paper introduces a token dropping method for BERT pretraining that reduces computational cost by 25% without sacrificing downstream task performance, by selectively dropping unimportant tokens during training.

Contribution

It proposes a simple, effective token dropping technique that leverages MLM loss to identify unimportant tokens, accelerating BERT pretraining without performance loss.

Findings

01

Pretraining cost reduced by 25%.

02

Maintains comparable downstream task performance.

03

Efficient token importance identification with MLM loss.

Abstract

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Adam · Attention Dropout · Residual Connection · Dense Connections