Improving Stability of Fine-Tuning Pretrained Language Models via Component-Wise Gradient Norm Clipping
Chenghao Yang, Xuezhe Ma

TL;DR
This paper introduces a component-wise gradient norm clipping technique to enhance the stability and convergence of fine-tuning large pretrained language models, addressing issues caused by varying layer convergence speeds.
Contribution
The paper proposes a novel component-wise gradient norm clipping method that improves fine-tuning stability and performance of pretrained language models.
Findings
Improved training stability and convergence speed.
Enhanced generalization performance.
Consistent results across different models and datasets.
Abstract
Fine-tuning over large pretrained language models (PLMs) has established many state-of-the-art results. Despite its superior performance, such fine-tuning can be unstable, resulting in significant variance in performance and potential risks for practical applications. Previous works have attributed such instability to the catastrophic forgetting problem in the top layers of PLMs, which indicates iteratively that fine-tuning layers in a top-down manner is a promising solution. In this paper, we first point out that this method does not always work out due to the different convergence speeds of different layers/modules. Inspired by this observation, we propose a simple component-wise gradient norm clipping method to adjust the convergence speed for different components. Experiment results demonstrate that our method achieves consistent improvements in terms of generalization performance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
