Improving Stability of Fine-Tuning Pretrained Language Models via   Component-Wise Gradient Norm Clipping

Chenghao Yang; Xuezhe Ma

arXiv:2210.10325·cs.CL·October 20, 2022

Improving Stability of Fine-Tuning Pretrained Language Models via Component-Wise Gradient Norm Clipping

Chenghao Yang, Xuezhe Ma

PDF

Open Access 1 Repo

TL;DR

This paper introduces a component-wise gradient norm clipping technique to enhance the stability and convergence of fine-tuning large pretrained language models, addressing issues caused by varying layer convergence speeds.

Contribution

The paper proposes a novel component-wise gradient norm clipping method that improves fine-tuning stability and performance of pretrained language models.

Findings

01

Improved training stability and convergence speed.

02

Enhanced generalization performance.

03

Consistent results across different models and datasets.

Abstract

Fine-tuning over large pretrained language models (PLMs) has established many state-of-the-art results. Despite its superior performance, such fine-tuning can be unstable, resulting in significant variance in performance and potential risks for practical applications. Previous works have attributed such instability to the catastrophic forgetting problem in the top layers of PLMs, which indicates iteratively that fine-tuning layers in a top-down manner is a promising solution. In this paper, we first point out that this method does not always work out due to the different convergence speeds of different layers/modules. Inspired by this observation, we propose a simple component-wise gradient norm clipping method to adjust the convergence speed for different components. Experiment results demonstrate that our method achieves consistent improvements in terms of generalization performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangalan123/finetuningstability
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings