Fine-Tuning Pretrained Language Models: Weight Initializations, Data   Orders, and Early Stopping

Jesse Dodge; Gabriel Ilharco; Roy Schwartz; Ali Farhadi; Hannaneh; Hajishirzi; Noah Smith

arXiv:2002.06305·cs.CL·February 19, 2020·216 cites

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh, Hajishirzi, Noah Smith

PDF

Open Access 4 Repos

TL;DR

This paper investigates the variability in fine-tuning pretrained language models, revealing how random seed choices affect performance, and offers practical guidelines to improve stability and reproducibility in NLP tasks.

Contribution

It systematically analyzes the impact of weight initialization and data order on fine-tuning variability, providing new insights and best practices for NLP model training.

Findings

01

Performance varies significantly with random seed choices.

02

Some weight initializations are consistently effective across tasks.

03

Early stopping can prevent divergence on small datasets.

Abstract

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax