I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu,, Sifan Zhou

TL;DR
This paper introduces I-LLM, a novel integer-only quantization framework for large language models that reduces floating-point operations, enabling efficient deployment on edge and cloud devices without significant accuracy loss.
Contribution
I-LLM is the first framework to achieve fully integer-only quantization for LLMs, addressing activation fluctuations with new smoothing and dynamic quantization techniques.
Findings
Achieves comparable accuracy to floating-point models at W4A4 quantization.
Outperforms existing non-integer quantization methods.
Enables efficient integer-only inference for large language models.
Abstract
Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Root Mean Square Layer Normalization
