LayerNorm: A key component in parameter-efficient fine-tuning
Taha ValizadehAslani, Hualou Liang

TL;DR
This paper identifies LayerNorm as the most critical component in BERT for fine-tuning, demonstrating that fine-tuning only LayerNorm achieves comparable performance to full model tuning, thus enabling more efficient NLP task adaptation.
Contribution
The study reveals that fine-tuning only LayerNorm layers in BERT is sufficient for competitive performance, offering a simple yet effective parameter-efficient fine-tuning method.
Findings
LayerNorm changes most during fine-tuning across tasks.
Fine-tuning only LayerNorm matches full fine-tuning performance.
Small subset of LayerNorm can be fine-tuned with negligible loss.
Abstract
Fine-tuning a pre-trained model, such as Bidirectional Encoder Representations from Transformers (BERT), has been proven to be an effective method for solving many natural language processing (NLP) tasks. However, due to the large number of parameters in many state-of-the-art NLP models, including BERT, the process of fine-tuning is computationally expensive. One attractive solution to this issue is parameter-efficient fine-tuning, which involves modifying only a minimal segment of the model while keeping the remainder unchanged. Yet, it remains unclear which segment of the BERT model is crucial for fine-tuning. In this paper, we first analyze different components in the BERT model to pinpoint which one undergoes the most significant changes after fine-tuning. We find that output LayerNorm changes more than any other components when fine-tuned for different General Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor Lasers and Optical Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · WordPiece · Multi-Head Attention · Weight Decay · Softmax · Dense Connections
