RomeBERT: Robust Training of Multi-Exit BERT
Shijie Geng, Peng Gao, Zuohui Fu, Yongfeng Zhang

TL;DR
RomeBERT introduces a robust training method for multi-exit BERT models, improving early exit performance and reducing training time through gradient regularized self-distillation and joint training, outperforming previous methods on GLUE tasks.
Contribution
The paper proposes RomeBERT, a novel training approach that enhances early exit performance and simplifies training for multi-exit BERT models using gradient regularized self-distillation.
Findings
RomeBERT outperforms DeeBERT on GLUE datasets.
It achieves better early exit accuracy.
Training time is reduced due to joint training strategy.
Abstract
BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly worse than late exits. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT), which can effectively solve the performance imbalance problem between early and late exits. Moreover, the proposed RomeBERT adopts a one-stage joint training strategy for multi-exits and the BERT backbone while DeeBERT needs two stages that require more training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsLinear Layer · DeeBERT · Layer Normalization · WordPiece · Residual Connection · Attention Dropout · Attention Is All You Need · Dense Connections · Adam · Linear Warmup With Linear Decay
