AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin; Jiaming Tang; Haotian Tang; Shang Yang; Wei-Ming Chen; Wei-Chen Wang; Guangxuan Xiao; Xingyu Dang; Chuang Gan; Song Han

arXiv:2306.00978·cs.CL·April 28, 2026·74 cites

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

PDF

50 Models 3 Videos

TL;DR

AWQ is a novel activation-aware weight quantization method that significantly improves low-bit LLM compression and acceleration, enabling efficient on-device deployment across various models and modalities.

Contribution

It introduces a hardware-friendly, activation distribution-based weight channel salience identification and scaling method that outperforms existing quantization techniques without requiring retraining.

Findings

01

AWQ reduces quantization error by protecting 1% salient weights.

02

AWQ outperforms existing methods on language and domain-specific benchmarks.

03

TinyChat framework achieves over 3x speedup for 4-bit LLM inference.

Abstract

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [MLSys'24 Best Paper]· youtube

TinyChatEngine running Llama2-7B on MacBook Pro (M1, 2021)· youtube

TinyChat: An Efficient and Lightweight System for LLMs on the Edge· youtube