How transfer learning impacts linguistic knowledge in deep NLP models?

Nadir Durrani; Hassan Sajjad; Fahim Dalvi

arXiv:2105.15179·cs.CL·June 1, 2021

How transfer learning impacts linguistic knowledge in deep NLP models?

Nadir Durrani, Hassan Sajjad, Fahim Dalvi

PDF

TL;DR

This paper examines how fine-tuning pre-trained language models like BERT, RoBERTa, and XLNet affects their internal linguistic knowledge, revealing that linguistic information is redistributed across layers depending on the task and architecture.

Contribution

It provides a detailed analysis of the layer-wise distribution of linguistic knowledge in models before and after fine-tuning across multiple architectures and tasks.

Findings

01

Linguistic knowledge is preserved in some tasks but forgotten in others after fine-tuning.

02

Post fine-tuning, linguistic information tends to localize in lower layers.

03

Different architectures retain linguistic knowledge at different depths, with BERT preserving it deeper than RoBERTa and XLNet.

Abstract

Transfer learning from pre-trained neural language models towards downstream tasks has been a predominant theme in NLP recently. Several researchers have shown that deep NLP models learn non-trivial amount of linguistic knowledge, captured at different layers of the model. We investigate how fine-tuning towards downstream NLP tasks impacts the learned linguistic knowledge. We carry out a study across popular pre-trained models BERT, RoBERTa and XLNet using layer and neuron-level diagnostic classifiers. We found that for some GLUE tasks, the network relies on the core linguistic information and preserve it deeper in the network, while for others it forgets. Linguistic information is distributed in the pre-trained language models but becomes localized to the lower layers post fine-tuning, reserving higher layers for the task specific knowledge. The pattern varies across architectures,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Linear Warmup With Linear Decay · Layer Normalization · SentencePiece · Residual Connection · WordPiece