BERT Busters: Outlier Dimensions that Disrupt Transformers
Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, Anna Rumshisky

TL;DR
This paper reveals that pre-trained Transformer models are highly sensitive to the removal of a tiny subset of high-magnitude LayerNorm parameters, which are crucial for maintaining model performance across various architectures.
Contribution
It uncovers the critical role of outlier LayerNorm parameters in Transformer robustness, challenging the belief that these models are highly resilient to pruning.
Findings
Removing a tiny fraction of LayerNorm parameters degrades performance.
High-magnitude normalization parameters are consistent and crucial across models.
Disabling outliers affects both MLM loss and downstream tasks.
Abstract
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs (<0.0001% of model weights). In case of BERT and other pre-trained encoder Transformers, the affected component is the scaling factors and biases in the LayerNorm. The outliers are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. We show that disabling them significantly degrades both the MLM loss and the downstream task performance. This effect is observed across several BERT-family models and other popular pre-trained Transformer architectures, including BART, XLNet and ELECTRA; we also show a similar effect in GPT-2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · GPT-2 · SentencePiece · BART
