SliderQuant: Accurate Post-Training Quantization for LLMs
Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao

TL;DR
This paper introduces SliderQuant, a novel post-training quantization framework for LLMs that adaptively adjusts quantization across layers, significantly reducing errors and outperforming existing methods.
Contribution
We propose SliderQuant, a layer-sensitive PTQ method with adaptive sliding quantization, improving accuracy for various LLMs over existing techniques.
Findings
Outperforms existing PTQ methods on multiple LLM benchmarks.
Effectively reduces quantization errors across different layers.
Works well with weight-only and weight-activation quantization.
Abstract
In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding-layer…
Peer Reviews
Decision·ICLR 2026 Poster
- Depth-aware sliding window actually makes early and late layers easier to quantize, instead of treating all layer depths the same. - Inter-layer and intra-layer sliding reinforce each other, so you get denser cross-layer synergy as compared to a fixed window. - On MoE (Table 4) it improves over OmniQuant at every bit setting, which helps generalizable, not tuned for one model claim. - Generation Table 5 is especially strong, 2-bit OmniQuant nearly collapses on DeepSeek-R1 distilled models,
- The method adds several schedule knobs (expand depth, contract depth, window size, γ), and robustness to non-ideal choices is not fully presented in the paper. - Comparisons are mostly against fixed-window, non-rotated post training quantization techniques. It’s unclear how much of the gains remain vs the strongest rotation/equivalent methods.
1. Clear and strong motivation: The paper is motivated by an empirically grounded observation on layer-wise sensitivity to quantization in LLMs. The motivation is clearly presented and addresses an overlooked aspect in post-training quantization. 2. Comprehensive experiments: The evaluation covers multiple model families and various bit-width settings, demonstrating the generality of the proposed framework. 3. Intuitive and well-written method: The proposed sliding-layer quantization framework
1. Uneven optimization frequency of middle layers: According to Figure 1 and the default hyperparameter setting, the 4th and 5th layers appear to be quantized only once. This means that some middle layers receive fewer optimization passes than their neighbors. Could this uneven optimization frequency introduce instability or suboptimal performance? In particular, when the middle-layer window size is larger than two, how do you ensure that all middle layers are optimized an equal number of times?
1. The authors identify varying sensitivities of different layers to quantization and improve the quantization performance of layers with different sensitivities through a sliding-window design, rather than directly adopting a mixed-precision approach. This provides a novel and interesting perspective. 2. The writing is clear and well-structured, the experiments are thorough, and the figures and tables are elegantly designed.
1. The description of intra-layer sliding quantization is the main weakness of the paper. As one of the core innovations, its explanation is too brief, which makes it confusing. Does it mean that the weights/activation matrices are also partitioned and quantized sequentially within each layer? 2. I'm afraid that whether the effectiveness of both learnable low-rank matrices A and B will be influenced after quantization because they have been integrated into weights before quantization during infe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
