Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs   on the Edge

Xuan Shen; Peiyan Dong; Lei Lu; Zhenglun Kong; Zhengang Li; Ming Lin,; Chao Wu; Yanzhi Wang

arXiv:2312.05693·cs.LG·April 22, 2025·1 cites

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin,, Chao Wu, Yanzhi Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

Agile-Quant introduces an activation-guided quantization framework and accelerator for large language models, enabling faster on-device inference on edge devices with minimal performance loss through 4-bit and 8-bit quantization.

Contribution

The paper presents a novel activation-aware quantization method and an efficient accelerator for LLMs, achieving significant speedups on edge devices while maintaining accuracy.

Findings

01

Achieves up to 2.55x speedup on edge devices with 4-bit and 8-bit quantization.

02

Maintains task performance comparable to weight-only quantization methods.

03

Successfully applies quantization to LLaMA, OPT, and BLOOM models.

Abstract

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models (LLMs), and implement an end-to-end accelerator on multiple edge devices for faster inference. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shawnricecake/agile-quant
pytorchOfficial

Videos

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge· underline

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques

MethodsPruning · BLOOM · OPT