Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations
Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi,, Mohammad Taher Pilehvar

TL;DR
This paper investigates how different transformer models encode linguistic knowledge across layers, revealing that model architecture and training objectives influence where and how linguistic information is stored, and proposes a more reliable probing method.
Contribution
The study extends layer-wise probing to ELECTRA and XLNet, demonstrating their different encoding behaviors and introducing an information-theoretic probing approach for more accurate analysis.
Findings
ELECTRA encodes linguistic knowledge mainly in deeper layers.
XLNet concentrates linguistic information in earlier layers.
Weight mixing evaluation can be misleading due to representation norm disparities.
Abstract
Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy -- which is widely used in the context of layer-wise probing -- can be misleading given the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · SentencePiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Adam · Multi-Head Attention · Layer Normalization
