On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers
Marius Mosbach, Anna Khokhlova, Michael A. Hedderich, Dietrich Klakow

TL;DR
This paper investigates how fine-tuning affects the linguistic knowledge in pre-trained models like BERT, RoBERTa, and ALBERT using sentence-level probing, revealing that fine-tuning can both enhance and diminish linguistic representations depending on the task and model.
Contribution
It provides a detailed analysis of the impact of fine-tuning on linguistic knowledge in pre-trained transformers, highlighting variability across models and tasks.
Findings
Fine-tuning causes substantial changes in probing accuracy for some tasks.
Changes in representations are larger in higher layers of models.
Fine-tuning sometimes improves probing accuracy beyond strong pooling methods.
Abstract
Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how fine-tuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Interpreting and Communication in Healthcare
MethodsLinear Layer · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · LAMB
