Layer-Wise High-Impact Parameter Ratio Optimization in Post-Training Quantization for Large Language Models
Cuong Pham, Hoang Anh Dung, Cuong C. Nguyen, Trung Le, Gustavo Carneiro, Thanh-Toan Do

TL;DR
This paper introduces a layer-wise optimization method for post-training quantization of large language models, balancing accuracy and efficiency by selectively quantizing high-impact parameters at different bit-widths.
Contribution
It proposes a quadratic optimization framework for layer-specific high-impact parameter ratios, improving quantization accuracy with minimal performance loss.
Findings
Achieves better accuracy than fixed-ratio methods.
Reduces computational overhead in quantization.
Maintains high performance at low bit-widths.
Abstract
Large language models (LLMs) have significantly advanced natural language processing, but their massive parameter counts create substantial computational and memory challenges during deployment. Post-training quantization (PTQ) has emerged as a promising approach to mitigate these challenges with minimal overhead. While existing PTQ methods can effectively quantize LLMs, they experience substantial accuracy loss at extremely low bit-widths, primarily due to high-impact parameters that significantly influence quantization performance. Several approaches address these issues by identifying and retaining the high-impact parameters in FP16 format. However, they apply fixed ratios of high-impact parameters across all layers, overlooking layer-wise sensitivity variations. In this paper, we propose a quadratic optimization framework that determines layer-specific ratios of high-impact…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Empirically strong. The gains in the W2A16/W2A16g64 configurations are substantial and consistent across LLaMA-2-7B/13B and multiple benchmarks. 2. Reasonable ablations and interpretability. Ablations disentangle the effect of layer-wise ratio optimization from the hybrid quantizer choice. Visualizations of learned ratios provide interpretable evidence that different layers receive different high-precision budgets, which aligns with the sensitivity analysis. 3. Sharp problem focus (extreme l
1. Limited detail on the optimization solver and scalability. While the quadratic objective and constraints are described, the paper does not fully specify how the discrete quadratic optimization is solved in practice (exact solver vs heuristic, convergence behavior, any approximations). 2. Limited model and task diversity. Experiments focus on LLaMA-2-7B/13B and standard language modeling + commonsense reasoning benchmarks. There are no experiments on other types of models (e.g., qwen-series)
- Significance — Valid Problem Formulation: The paper provides strong motivation by empirically demonstrating the non-uniform distribution of parameter sensitivity (using Fisher information in Fig. 1 & A.1). This clearly shows the limitations of a fixed-ratio allocation and establishes a strong need for the proposed layer-specific approach. - Originality — Principled Optimization Framework: The paper reformulates the layer-wise ratio selection as a quadratic optimization problem (Eq. 5-9), deriv
1. Issues with Experimental Fairness and Validity The SOTA comparison results are insufficient to clearly prove the method's superiority. - Reliance on Cited Results: Most SOTA results in Tables 1 and 2 (for GPTQ, OmniQuant, CBQ, etc.) are cited from other papers, not reproduced by the authors in a controlled environment. This makes a fair comparison difficult due to potential differences in calibration datasets, preprocessing, and implementation details. - Unfair Average Bit (Avg. bit) Comparis
1. The motivation is clear, using channel-wise mix-precision quantization and seting the high-precision propotion through layer-wise sensitivity. 2. The background introduction is comprehensive.
1. The experiment results are weak. This paper focus on weight-only quantization, and should compare with more recent state-of-the-art methods, such as EfficientQAT [1], DB-LLM [2], QUIP# [3], ParatoQ [4]. 2. The inference efficiency of proposed mix-precision quantization should be measured. 3. It should compare with other mix-precision quantization method, such as SqueezeLLM, in the same average bits. [1] Efficientqat: Efficient quantization-aware training for large language models, ACL2025 [
Combining AdaRound only for high-impact parameters is an efficient and good idea. It seems reasonable to apply distinct quantization strategies to important parameters and regular ones. Empirical results demonstrate improved perplexity over OmniQuant and CBQ, particularly under 2–3 bit quantization.
Although assigning different quantization strategies to important and regular parameters is an interesting direction, the proposed approach seems relatively complex. Moreover, the paper does not discuss whether the AdaRound and OmniQuant strategies might conflict or interfere with each other. The main experimental results (Tables 1, 2, and 3) are primarily based on the LLaMA models. Whether the proposed method generalizes well to other models remains to be further investigated. While efficienc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Natural Language Processing Techniques
