Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Giovanni Puccetti, Anna Rogers, Aleksandr Drozd, Felice, Dell'Orletta

TL;DR
This paper investigates the outlier phenomenon in Transformer models, linking it to token frequency and embedding space geometry, and highlights the importance of token distribution in model robustness.
Contribution
It replicates the outlier phenomenon and connects it to token frequency and embedding geometry, suggesting new pre-training strategies for improved robustness.
Findings
Outlier dimensions correlate with token frequency.
Disabling outliers significantly reduces model performance.
Outliers enable models to focus on special tokens.
Abstract
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the "vertical" self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Byte Pair Encoding · Dense Connections · Linear Warmup With Linear Decay · Dropout · Absolute Position Encodings · Attention Dropout
