Understanding and Minimising Outlier Features in Neural Network Training

Bobby He; Lorenzo Noci; Daniele Paliotta; Imanol Schlag; Thomas; Hofmann

arXiv:2405.19279·cs.LG·November 8, 2024·2 cites

Understanding and Minimising Outlier Features in Neural Network Training

Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas, Hofmann

PDF

Open Access 1 Repo

TL;DR

This paper investigates the emergence of Outlier Features in neural networks, introduces methods to measure and minimize them, and demonstrates improved quantization performance in large transformer models.

Contribution

It introduces the Outlier Protected transformer block and highlights the benefits of non-diagonal preconditioning to reduce Outlier Features during training.

Findings

01

Outlier Features can be effectively measured using kurtosis metrics.

02

The proposed OP block and SOAP method significantly reduce Outlier Features.

03

Combining OP block with non-diagonal preconditioning improves quantization accuracy.

Abstract

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bobby-he/simplified_transformers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAdam · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings