Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

Giovanni Puccetti; Anna Rogers; Aleksandr Drozd; Felice; Dell'Orletta

arXiv:2205.11380·cs.CL·June 19, 2024·1 cites

Outliers Dimensions that Disrupt Transformers Are Driven by Frequency

Giovanni Puccetti, Anna Rogers, Aleksandr Drozd, Felice, Dell'Orletta

PDF

Open Access 1 Repo

TL;DR

This paper investigates the outlier phenomenon in Transformer models, linking it to token frequency and embedding space geometry, and highlights the importance of token distribution in model robustness.

Contribution

It replicates the outlier phenomenon and connects it to token frequency and embedding geometry, suggesting new pre-training strategies for improved robustness.

Findings

01

Outlier dimensions correlate with token frequency.

02

Disabling outliers significantly reduces model performance.

03

Outliers enable models to focus on special tokens.

Abstract

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the "vertical" self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpucce/outliersvsfreq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Byte Pair Encoding · Dense Connections · Linear Warmup With Linear Decay · Dropout · Absolute Position Encodings · Attention Dropout