Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang, Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu

TL;DR
This paper introduces a novel outlier suppression framework for low-bit transformer models, significantly improving quantization performance and enabling 6-bit BERT quantization to reach full-precision accuracy.
Contribution
It reveals the role of LayerNorm gamma as an outlier amplifier and proposes Gamma Migration and Token-Wise Clipping to effectively suppress outliers without extra computational burden.
Findings
Surpasses existing methods in outlier suppression
Enables 6-bit BERT quantization to match full-precision performance
Provides a plug-and-play framework for low-bit transformer quantization
Abstract
Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Residual Connection · Weight Decay · Attention Dropout
