Fine-Tuned Transformers Show Clusters of Similar Representations Across   Layers

Jason Phang; Haokun Liu; Samuel R. Bowman

arXiv:2109.08406·cs.CL·September 21, 2021

Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

Jason Phang, Haokun Liu, Samuel R. Bowman

PDF

Open Access

TL;DR

This paper investigates how fine-tuning affects neural network representations in language models, revealing that similar representations form clusters across layers and that later layers can often be discarded without performance loss.

Contribution

It introduces the use of centered kernel alignment (CKA) to analyze representation similarity in fine-tuned transformers and uncovers a consistent clustering pattern across layers.

Findings

01

Representation similarity forms block diagonal structures.

02

Later layers contribute marginally to task performance.

03

Top layers can be removed without degrading accuracy.

Abstract

Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Weight Decay · LAMB · RoBERTa · Dropout