Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models
Cuong Pham, Hoang Anh Dung, Cuong C. Nguyen, Trung Le, Gustavo Carneiro, Jianfei Cai, Thanh-Toan Do

TL;DR
This paper introduces an adaptive, layer-wise transformation method for post-training quantization of large language models, significantly improving performance by selecting optimal transformations per layer based on weight distribution characteristics.
Contribution
It proposes a novel adaptive transformation selection framework that uses weight distribution kurtosis to efficiently determine the best transformation type for each layer in LLMs.
Findings
Achieves up to 4.58 perplexity point improvement.
Gains 2.11% in zero-shot accuracy on six tasks.
Outperforms existing fixed transformation methods.
Abstract
Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and weights, which cause substantial LLM performance degradation, especially at low-bit settings. While existing transformation-based methods like affine and rotation transformations successfully mitigate outliers, they apply the homogeneous transformation setting, i.e., using the same transformation types across all layers, ignoring the heterogeneous distribution characteristics within LLMs. In this paper, we propose an adaptive transformation selection framework that systematically determines optimal transformations on a per-layer basis. To this end, we first formulate transformation selection as a differentiable optimization problem to achieve the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The empirical results demonstrate significant performance improvements compared to previous methods.
**Limited Novelty:** The proposed method closely resembles a combination of *FlatQuant* and rotation based method, which reduces the perceived novelty of the contribution. **Clarity and Organization:** The paper is difficult to follow, and the connections between the different contributions are not clearly articulated. For example, what they mean by rotation based transformation is it Hadamard Transform or it's also a learnable transform? **Limited Experimental Scope:**
1. Good Compatibility: It can be easily integrated with existing methods. 2. Rigorous Experiments: The paper provides detailed parameter analysis and correlation analysis, among others. 3. The paper is well-written and easy to follow.
1. Limited Innovation: The approach seems to be a simple integration of existing quantization methods through manually defined heuristic rules. The overall potential for improvement is limited. 2. Limited Applicability: The analysis and experimental conclusions are focused on the Llama model family. It is unclear whether the framework is effective on other model families, such as Qwen. Would the heuristic strategy still be effective? 3. Limited Improvement: In the W3A3K3V3 and W4A4K4V4 settings,
1. The proposed heuristic metric is verified thoroughly by various empirical studies. 2. The proposed method is well assessed via experiments. 3. The paper is overall well-structured, making it easy to follow.
1. Lack of illustrations about how to implement the mixed method. As far as I know, FlatQuant and orthogonal transformation (if keeping computational invariance) are global across layers, how do we combine both of them without conflicts? 2. Lack of implementation details regarding the hardware part. There is no description of how to obtain the statistics in Table 5. In addition, why is the QuaRot method slower than the proposed method, since it is a mixed one that may involve more operations and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
