How transfer learning impacts linguistic knowledge in deep NLP models?
Nadir Durrani, Hassan Sajjad, Fahim Dalvi

TL;DR
This paper examines how fine-tuning pre-trained language models like BERT, RoBERTa, and XLNet affects their internal linguistic knowledge, revealing that linguistic information is redistributed across layers depending on the task and architecture.
Contribution
It provides a detailed analysis of the layer-wise distribution of linguistic knowledge in models before and after fine-tuning across multiple architectures and tasks.
Findings
Linguistic knowledge is preserved in some tasks but forgotten in others after fine-tuning.
Post fine-tuning, linguistic information tends to localize in lower layers.
Different architectures retain linguistic knowledge at different depths, with BERT preserving it deeper than RoBERTa and XLNet.
Abstract
Transfer learning from pre-trained neural language models towards downstream tasks has been a predominant theme in NLP recently. Several researchers have shown that deep NLP models learn non-trivial amount of linguistic knowledge, captured at different layers of the model. We investigate how fine-tuning towards downstream NLP tasks impacts the learned linguistic knowledge. We carry out a study across popular pre-trained models BERT, RoBERTa and XLNet using layer and neuron-level diagnostic classifiers. We found that for some GLUE tasks, the network relies on the core linguistic information and preserve it deeper in the network, while for others it forgets. Linguistic information is distributed in the pre-trained language models but becomes localized to the lower layers post fine-tuning, reserving higher layers for the task specific knowledge. The pattern varies across architectures,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Linear Warmup With Linear Decay · Layer Normalization · SentencePiece · Residual Connection · WordPiece
