Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Yifan Lu, Yerui Sun, Lin, Ma, Yuchen Xie

TL;DR
Integer Scale is a post-training quantization method that significantly accelerates large language model inference without additional calibration or fine-tuning, achieving over 2x speedups with minimal accuracy loss.
Contribution
It introduces a plug-and-play quantization scheme that enhances existing methods, enabling faster inference for large language models without extra costs.
Findings
Up to 2.31x speedup on LLaMA-3 models
Negligible performance degradation with Integer Scale
Compatible with most fine-grained quantization methods
Abstract
We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The article is well-written and includes numerous diagrams, enhancing its clarity and facilitating understanding. The proposed method is simple and straightforward. Extensive experiments have been conducted, comparing both accuracy and speed against various alternative methods.
This paper lacks novelty, as the concept of Integer Scale has been discussed in numerous quantization studies[1,2,3]. The absence of speed and accuracy comparisons with previous full integer quantization methods renders the findings unconvincing. Additionally, the proposed method of using a scale amplifier to convert floating-point scale to integer scale aligns with the widely used round-to-nearest fixed-point quantization technique. [1] Training High-Performance and Large-Scale Deep Neural Net
Integer Scale addresses inference bottlenecks in fine-grained quantization methods while maintaining accuracy. Integer Scale achieves up to a 1.85x end-to-end speed boost compared to FP16 precision and outperforms existing methods such as W4A16 and W4A8, all while requiring no additional calibration or fine-tuning—making it a "free lunch" solution for implementation. The comprehensive evaluation covers a variety of model architectures, including Llama models and the MoE model (Mixtral 8x7B). In
The Integer Scale formulation appears to build on concepts introduced in VSQuant, which suggested the use of FP16 scaling factors for model quantization, followed by quantizing these factors to integers. Additionally, the implementation seems to draw inspiration from established frameworks such as FastGEMM and Atom. This overlap may make it somewhat challenging to distinguish the original contributions of this paper.
The method of this work was simple.
1. The proposed method in this paper is rather simplistic, essentially functioning as a straightforward parameter manipulation technique. The approach lacks the depth or complexity one might expect in advanced quantization methods for large language models. 2. The paper’s structure and presentation are also suboptimal, deviating from standard conventions in academic writing. This lack of clarity and cohesion detracts from its readability and scholarly rigor. 3. Furthermore, the comparative ana
a. Originality The paper offers a reasonably original contribution by addressing the performance issues inherent in per-group quantization for low-bit weight-only quantization methods. By introducing an integer scaling method in this context, it directly and effectively resolves the problem where grouped matrix multiplication results require extensive type conversions (I32toF32), which previously diminished any speed advantages. b. Quality The paper demonstrates thorough experimentation. It val
a. Limitations of the Integer Scale Method and Potential for Overflow The Integer Scale method proposed in the paper essentially involves converting floating-point scales to fixed-point integers. While this approach effectively reduces the need for type conversions (I32toF32) and improves computation speed in the cases presented, it may not be robust in scenarios where weight scales have larger values. If the float scales are significantly large, directly multiplying by the amplification factor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Radiomics and Machine Learning in Medical Imaging
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
