Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers
Jason Phang, Haokun Liu, Samuel R. Bowman

TL;DR
This paper investigates how fine-tuning affects neural network representations in language models, revealing that similar representations form clusters across layers and that later layers can often be discarded without performance loss.
Contribution
It introduces the use of centered kernel alignment (CKA) to analyze representation similarity in fine-tuned transformers and uncovers a consistent clustering pattern across layers.
Findings
Representation similarity forms block diagonal structures.
Later layers contribute marginally to task performance.
Top layers can be removed without degrading accuracy.
Abstract
Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Weight Decay · LAMB · RoBERTa · Dropout
