Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Qingyuan Li; Ran Meng; Yiduo Li; Bo Zhang; Yifan Lu; Yerui Sun; Lin; Ma; Yuchen Xie

arXiv:2405.14597·cs.LG·May 29, 2024

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Yifan Lu, Yerui Sun, Lin, Ma, Yuchen Xie

PDF

Open Access 4 Reviews

TL;DR

Integer Scale is a post-training quantization method that significantly accelerates large language model inference without additional calibration or fine-tuning, achieving over 2x speedups with minimal accuracy loss.

Contribution

It introduces a plug-and-play quantization scheme that enhances existing methods, enabling faster inference for large language models without extra costs.

Findings

01

Up to 2.31x speedup on LLaMA-3 models

02

Negligible performance degradation with Integer Scale

03

Compatible with most fine-grained quantization methods

Abstract

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

The article is well-written and includes numerous diagrams, enhancing its clarity and facilitating understanding. The proposed method is simple and straightforward. Extensive experiments have been conducted, comparing both accuracy and speed against various alternative methods.

Weaknesses

This paper lacks novelty, as the concept of Integer Scale has been discussed in numerous quantization studies[1,2,3]. The absence of speed and accuracy comparisons with previous full integer quantization methods renders the findings unconvincing. Additionally, the proposed method of using a scale amplifier to convert floating-point scale to integer scale aligns with the widely used round-to-nearest fixed-point quantization technique. [1] Training High-Performance and Large-Scale Deep Neural Net

Reviewer 02Rating 6Confidence 3

Strengths

Integer Scale addresses inference bottlenecks in fine-grained quantization methods while maintaining accuracy. Integer Scale achieves up to a 1.85x end-to-end speed boost compared to FP16 precision and outperforms existing methods such as W4A16 and W4A8, all while requiring no additional calibration or fine-tuning—making it a "free lunch" solution for implementation. The comprehensive evaluation covers a variety of model architectures, including Llama models and the MoE model (Mixtral 8x7B). In

Weaknesses

The Integer Scale formulation appears to build on concepts introduced in VSQuant, which suggested the use of FP16 scaling factors for model quantization, followed by quantizing these factors to integers. Additionally, the implementation seems to draw inspiration from established frameworks such as FastGEMM and Atom. This overlap may make it somewhat challenging to distinguish the original contributions of this paper.

Reviewer 03Rating 1Confidence 5

Strengths

The method of this work was simple.

Weaknesses

1. The proposed method in this paper is rather simplistic, essentially functioning as a straightforward parameter manipulation technique. The approach lacks the depth or complexity one might expect in advanced quantization methods for large language models. 2. The paper’s structure and presentation are also suboptimal, deviating from standard conventions in academic writing. This lack of clarity and cohesion detracts from its readability and scholarly rigor. 3. Furthermore, the comparative ana

Reviewer 04Rating 5Confidence 5

Strengths

a. Originality The paper offers a reasonably original contribution by addressing the performance issues inherent in per-group quantization for low-bit weight-only quantization methods. By introducing an integer scaling method in this context, it directly and effectively resolves the problem where grouped matrix multiplication results require extensive type conversions (I32toF32), which previously diminished any speed advantages. b. Quality The paper demonstrates thorough experimentation. It val

Weaknesses

a. Limitations of the Integer Scale Method and Potential for Overflow The Integer Scale method proposed in the paper essentially involves converting floating-point scales to fixed-point integers. While this approach effectively reduces the need for type conversions (I32toF32) and improves computation speed in the cases presented, it may not be robust in scenarios where weight scales have larger values. If the float scales are significantly large, directly multiplying by the amplification factor

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Radiomics and Machine Learning in Medical Imaging

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings