Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang, Lin, Jingren Zhou

TL;DR
This paper introduces Rotated Runtime Smooth, a training-free activation smoothing technique that improves INT4 quantization accuracy for large language models by effectively handling outliers without high latency or accuracy loss.
Contribution
The paper proposes a novel plug-and-play activation smoother combining Runtime Smooth and Rotation to better handle outliers in quantization, outperforming existing methods.
Findings
Outperforms state-of-the-art methods on LLaMA and Qwen models.
Significantly reduces WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
Effectively mitigates outliers without high latency or accuracy degradation.
Abstract
Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
Methodstravel james · LLaMA
