Parameter-Efficient Tuning Makes a Good Classification Head
Zhuoyi Yang, Ming Ding, Yanhui Guo, Qingsong Lv, Jie Tang

TL;DR
This paper shows that parameter-efficient tuning methods can create effective classification heads for pretrained models, leading to stable performance improvements across multiple NLP tasks without full finetuning.
Contribution
It introduces the idea that parameter-efficient tuning can produce good classification heads, reducing the need for full model finetuning and enhancing stability.
Findings
Pretrained classification heads via parameter-efficient tuning improve performance.
The approach is effective across 9 GLUE and SuperGLUE tasks.
Stable performance gains are achieved without full finetuning.
Abstract
In recent years, pretrained models revolutionized the paradigm of natural language understanding (NLU), where we append a randomly initialized classification head after the pretrained backbone, e.g. BERT, and finetune the whole model. As the pretrained backbone makes a major contribution to the improvement, we naturally expect a good pretrained classification head can also benefit the training. However, the final-layer output of the backbone, i.e. the input of the classification head, will change greatly during finetuning, making the usual head-only pretraining (LP-FT) ineffective. In this paper, we find that parameter-efficient tuning makes a good classification head, with which we can simply replace the randomly initialized heads for a stable performance gain. Our experiments demonstrate that the classification head jointly pretrained with parameter-efficient tuning consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Dense Connections · Linear Layer · Layer Normalization · Residual Connection · Dropout
