Understanding and Minimising Outlier Features in Neural Network Training
Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas, Hofmann

TL;DR
This paper investigates the emergence of Outlier Features in neural networks, introduces methods to measure and minimize them, and demonstrates improved quantization performance in large transformer models.
Contribution
It introduces the Outlier Protected transformer block and highlights the benefits of non-diagonal preconditioning to reduce Outlier Features during training.
Findings
Outlier Features can be effectively measured using kurtosis metrics.
The proposed OP block and SOAP method significantly reduce Outlier Features.
Combining OP block with non-diagonal preconditioning improves quantization accuracy.
Abstract
Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAdam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
