Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin,, Chao Wu, Yanzhi Wang

TL;DR
Agile-Quant introduces an activation-guided quantization framework and accelerator for large language models, enabling faster on-device inference on edge devices with minimal performance loss through 4-bit and 8-bit quantization.
Contribution
The paper presents a novel activation-aware quantization method and an efficient accelerator for LLMs, achieving significant speedups on edge devices while maintaining accuracy.
Findings
Achieves up to 2.55x speedup on edge devices with 4-bit and 8-bit quantization.
Maintains task performance comparable to weight-only quantization methods.
Successfully applies quantization to LLaMA, OPT, and BLOOM models.
Abstract
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models (LLMs), and implement an end-to-end accelerator on multiple edge devices for faster inference. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques
MethodsPruning · BLOOM · OPT
