RomeBERT: Robust Training of Multi-Exit BERT

Shijie Geng; Peng Gao; Zuohui Fu; Yongfeng Zhang

arXiv:2101.09755·cs.CL·January 26, 2021·6 cites

RomeBERT: Robust Training of Multi-Exit BERT

Shijie Geng, Peng Gao, Zuohui Fu, Yongfeng Zhang

PDF

Open Access 1 Repo

TL;DR

RomeBERT introduces a robust training method for multi-exit BERT models, improving early exit performance and reducing training time through gradient regularized self-distillation and joint training, outperforming previous methods on GLUE tasks.

Contribution

The paper proposes RomeBERT, a novel training approach that enhances early exit performance and simplifies training for multi-exit BERT models using gradient regularized self-distillation.

Findings

01

RomeBERT outperforms DeeBERT on GLUE datasets.

02

It achieves better early exit accuracy.

03

Training time is reduced due to joint training strategy.

Abstract

BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly worse than late exits. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT), which can effectively solve the performance imbalance problem between early and late exits. Moreover, the proposed RomeBERT adopts a one-stage joint training strategy for multi-exits and the BERT backbone while DeeBERT needs two stages that require more training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

romebert/RomeBERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsLinear Layer · DeeBERT · Layer Normalization · WordPiece · Residual Connection · Attention Dropout · Attention Is All You Need · Dense Connections · Adam · Linear Warmup With Linear Decay