Do we need Label Regularization to Fine-tune Pre-trained Language Models?
Ivan Kobyzev, Aref Jafari, Mehdi Rezagholizadeh, Tianda Li, Alan, Do-Omri, Peng Lu, Pascal Poupart, Ali Ghodsi

TL;DR
This paper investigates whether label regularization techniques like Knowledge Distillation are necessary for fine-tuning pre-trained language models, finding that pre-training itself suffices as a regularizer and additional methods are redundant.
Contribution
The study provides a comprehensive experimental analysis showing that label regularization techniques are unnecessary when fine-tuning pre-trained language models, challenging common practices.
Findings
KD and label regularization do not improve fine-tuning of pre-trained models.
Pre-training acts as an effective regularizer, reducing the need for additional regularization.
Extensive experiments across NLP and vision tasks support the conclusions.
Abstract
Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Dropout · Cosine Annealing · Discriminative Fine-Tuning · Adam
