bert2BERT: Towards Reusable Pretrained Language Models

Cheng Chen; Yichun Yin; Lifeng Shang; Xin Jiang; Yujia Qin; Fengyu; Wang; Zhi Wang; Xiao Chen; Zhiyuan Liu; Qun Liu

arXiv:2110.07143·cs.CL·October 15, 2021·6 cites

bert2BERT: Towards Reusable Pretrained Language Models

Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu, Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, Qun Liu

PDF

Open Access

TL;DR

bert2BERT introduces a method to transfer knowledge from smaller pre-trained models to larger ones, significantly reducing pre-training costs and improving efficiency across different transformer-based models.

Contribution

It extends function-preserving transfer techniques and proposes a two-stage pre-training method to enhance large model initialization and training efficiency.

Findings

01

Reduces pre-training computational costs by approximately 45-47%.

02

Demonstrates the method's applicability to various pre-trained models.

03

Achieves significant efficiency improvements over training from scratch.

Abstract

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERT_BASE) to a large model (e.g., BERT_LARGE) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving on Transformer-based language model, and further improve it by proposing advanced knowledge for large model's initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Dense Connections · Softmax · Residual Connection · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?