LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid
Tianyi Zhang, Anshumali Shrivastava

TL;DR
LeanQuant introduces a loss-error-aware grid-based quantization method for large language models, improving accuracy and scalability while maintaining compatibility with popular frameworks, enabling efficient quantization of models like Llama-3.1 405B.
Contribution
It proposes a novel loss-error-aware grid approach that overcomes limitations of prior min-max affine grids, enhancing model quality and framework compatibility.
Findings
Achieves high accuracy in quantizing large models like Llama-3.1 405B.
Demonstrates scalability by quantizing a 405B parameter model in 21 hours.
Outperforms existing methods in model quality and versatility.
Abstract
Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead for quantizing models, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-Error-Aware…
Peer Reviews
Decision·ICLR 2025 Poster
- They present an algorithm to tune the quantization grid using the layerwise loss-aware objective - They present both non-uniform and uniform methods (the non-uniform method is flexible and leverages clustering, whereas the non-uniform method performs a constrained search over potential scale factors) - They provide accelerated GPU kernels to solve the objective for their affine quantization approach - They provide detailed analysis of their method against prior work in both uniform and non-uni
- Although the method of determining weight sensitivity using the layerwise loss is distinct, the approach of performing K-Means clustering in equation (6) to derive non-uniform datatypes is the same as prior work (eg. the sensitivity-weighted clustering approach to derive non-uniform datatypes in SqueezeLLM)
The idea is simple and effective, compatible with iterative loss-error-based quantization approaches such as GPTQ. The calibration process is relatively lightweight, allowing for the quantization of models up to 405B. The experimental design of this paper considers the latest model series and different model sizes, which is comprehensive.
LeanQuant is built on the core idea of using "inverse Hessian diagonals" metric to quantify the importance of weights. Similar ideas are also introduced in works such as SqueezeLLM [1], which exploits the "Hessian diagonal" metric directly derived from the quantization loss function to identify important weights. Compared to the metric in SqueezeLLM, I think the derivation of the metric in LeanQuant is relatively heuristic. Although LeanQuant beats SqueezeLLM in the final performance, it would b
1. LeanQuant calibrates quantization grids, and it is a novel approach in the post-training quantization of LLMs. 2. The development of fused CUDA kernels efficiently supports the calibration process, so it makes the complex grid learning procedure more accessible for broad application due to its enhanced efficiency.
Overall, several details and evaluation results are missing, which makes it difficult to fully understand and be convinced of the effectiveness of LeanQuant. 1. A detailed explanation of how LeanQuant can be extended to non-uniform quantization is missing. 2. Although Section 4 (Experiments) states that the proposed method will be compared with AWQ, GPTQ, and OmniQuant, the evaluation results for AWQ are not presented in the main text (some results for AWQ are included in the appendix only). 3.
Code & Models
- 🤗xmadai/Llama-3.1-405B-Instruct-xMADai-INT4model· 64 dl· ♡ 664 dl♡ 6
- 🤗xmadai/Mistral-Large-Instruct-2407-xMADai-INT4model· 63 dl· ♡ 763 dl♡ 7
- 🤗xmadai/gemma-2-9b-it-xMADai-INT4model· 5 dl· ♡ 45 dl♡ 4
- 🤗xmadai/Llama-3.1-Nemotron-70B-Instruct-xMADai-INT4model· 7 dl· ♡ 47 dl♡ 4
- 🤗xmadai/Llama-3.1-70B-Instruct-xMADai-INT4model· 1 dl1 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
