I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit   Large Language Models

Xing Hu; Yuan Cheng; Dawei Yang; Zhihang Yuan; Jiangyong Yu; Chen Xu,; Sifan Zhou

arXiv:2405.17849·cs.LG·June 6, 2024·1 cites

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

Xing Hu, Yuan Cheng, Dawei Yang, Zhihang Yuan, Jiangyong Yu, Chen Xu,, Sifan Zhou

PDF

Open Access

TL;DR

This paper introduces I-LLM, a novel integer-only quantization framework for large language models that reduces floating-point operations, enabling efficient deployment on edge and cloud devices without significant accuracy loss.

Contribution

I-LLM is the first framework to achieve fully integer-only quantization for LLMs, addressing activation fluctuations with new smoothing and dynamic quantization techniques.

Findings

01

Achieves comparable accuracy to floating-point models at W4A4 quantization.

02

Outperforms existing non-integer quantization methods.

03

Enables efficient integer-only inference for large language models.

Abstract

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Root Mean Square Layer Normalization