Rotated Runtime Smooth: Training-Free Activation Smoother for accurate   INT4 inference

Ke Yi; Zengke Liu; Jianwei Zhang; Chengyuan Li; Tong Zhang; Junyang; Lin; Jingren Zhou

arXiv:2409.20361·cs.LG·November 12, 2024

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang, Lin, Jingren Zhou

PDF

Open Access

TL;DR

This paper introduces Rotated Runtime Smooth, a training-free activation smoothing technique that improves INT4 quantization accuracy for large language models by effectively handling outliers without high latency or accuracy loss.

Contribution

The paper proposes a novel plug-and-play activation smoother combining Runtime Smooth and Rotation to better handle outliers in quantization, outperforming existing methods.

Findings

01

Outperforms state-of-the-art methods on LLaMA and Qwen models.

02

Significantly reduces WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

03

Effectively mitigates outliers without high latency or accuracy degradation.

Abstract

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

Methodstravel james · LLaMA