TL;DR
AWQ is a novel activation-aware weight quantization method that significantly improves low-bit LLM compression and acceleration, enabling efficient on-device deployment across various models and modalities.
Contribution
It introduces a hardware-friendly, activation distribution-based weight channel salience identification and scaling method that outperforms existing quantization techniques without requiring retraining.
Findings
AWQ reduces quantization error by protecting 1% salient weights.
AWQ outperforms existing methods on language and domain-specific benchmarks.
TinyChat framework achieves over 3x speedup for 4-bit LLM inference.
Abstract
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗feanors/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-AWQ-INT4model· 8.9k dl· ♡ 68.9k dl♡ 6
- 🤗dahara1/ELYZA-japanese-Llama-2-7b-instruct-AWQmodel· 11 dl· ♡ 111 dl♡ 1
- 🤗mmnga/Xwin-LM-7B-AWQ-calib-ja-100kmodel· 12 dl· ♡ 212 dl♡ 2
- 🤗mmnga/ELYZA-japanese-Llama-2-7b-fast-instruct-AWQ-calib-ja-100kmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗TRAC-MTRY/traclm-v2-7b-instruct-AWQmodel
- 🤗mmnga/japanese-stablelm-instruct-gamma-7b-AWQ-calib-ja-1kmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗mmnga/japanese-stablelm-base-gamma-7b-AWQ-calib-ja-1kmodel· 10 dl10 dl
- 🤗internlm/internlm2-chat-7b-4bitsmodel· 85 dl· ♡ 485 dl♡ 4
- 🤗internlm/internlm2-chat-20b-4bitsmodel· 1.1k dl· ♡ 71.1k dl♡ 7
- 🤗disi-unibo-nlp/pmc-llama-13b-awqmodel· 4 dl4 dl
Videos
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [MLSys'24 Best Paper]· youtube
TinyChatEngine running Llama2-7B on MacBook Pro (M1, 2021)· youtube
TinyChat: An Efficient and Lightweight System for LLMs on the Edge· youtube
