bert2BERT: Towards Reusable Pretrained Language Models
Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu, Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, Qun Liu

TL;DR
bert2BERT introduces a method to transfer knowledge from smaller pre-trained models to larger ones, significantly reducing pre-training costs and improving efficiency across different transformer-based models.
Contribution
It extends function-preserving transfer techniques and proposes a two-stage pre-training method to enhance large model initialization and training efficiency.
Findings
Reduces pre-training computational costs by approximately 45-47%.
Demonstrates the method's applicability to various pre-trained models.
Achieves significant efficiency improvements over training from scratch.
Abstract
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERT_BASE) to a large model (e.g., BERT_LARGE) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving on Transformer-based language model, and further improve it by proposing advanced knowledge for large model's initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Dense Connections · Softmax · Residual Connection · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
