Do we need Label Regularization to Fine-tune Pre-trained Language   Models?

Ivan Kobyzev; Aref Jafari; Mehdi Rezagholizadeh; Tianda Li; Alan; Do-Omri; Peng Lu; Pascal Poupart; Ali Ghodsi

arXiv:2205.12428·cs.LG·April 13, 2023·1 cites

Do we need Label Regularization to Fine-tune Pre-trained Language Models?

Ivan Kobyzev, Aref Jafari, Mehdi Rezagholizadeh, Tianda Li, Alan, Do-Omri, Peng Lu, Pascal Poupart, Ali Ghodsi

PDF

Open Access

TL;DR

This paper investigates whether label regularization techniques like Knowledge Distillation are necessary for fine-tuning pre-trained language models, finding that pre-training itself suffices as a regularizer and additional methods are redundant.

Contribution

The study provides a comprehensive experimental analysis showing that label regularization techniques are unnecessary when fine-tuning pre-trained language models, challenging common practices.

Findings

01

KD and label regularization do not improve fine-tuning of pre-trained models.

02

Pre-training acts as an effective regularizer, reducing the need for additional regularization.

03

Extensive experiments across NLP and vision tasks support the conclusions.

Abstract

Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Dropout · Cosine Annealing · Discriminative Fine-Tuning · Adam