OutlierTune: Efficient Channel-Wise Quantization for Large Language   Models

Jinguang Wang; Yuexi Yin; Haifeng Sun; Qi Qi; Jingyu Wang; Zirui; Zhuang; Tingting Yang; Jianxin Liao

arXiv:2406.18832·cs.CL·June 28, 2024

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Jinguang Wang, Yuexi Yin, Haifeng Sun, Qi Qi, Jingyu Wang, Zirui, Zhuang, Tingting Yang, Jianxin Liao

PDF

Open Access

TL;DR

OutlierTune is a novel, efficient channel-wise quantization method for large language models that improves accuracy, reduces computational overhead, and enhances hardware efficiency, enabling faster inference and lower memory usage.

Contribution

It introduces a new per-channel PTQ approach with dequantization pre-execution and symmetrization, addressing structured outliers in LLM activations.

Findings

01

Outperforms existing quantization methods across multiple tasks.

02

Achieves Int6 quantization comparable to FP16 for instruction-tuned LLMs.

03

Runs 1.48x faster than FP16 with half the memory usage.

Abstract

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsFocus · OPT-IML