SliderQuant: Accurate Post-Training Quantization for LLMs

Shigeng Wang; Chao Li; Yangyuxuan Kang; Jiawei Fan; Zhonghong Ou; Anbang Yao

arXiv:2603.25284·cs.AI·March 27, 2026

SliderQuant: Accurate Post-Training Quantization for LLMs

Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Zhonghong Ou, Anbang Yao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SliderQuant, a novel post-training quantization framework for LLMs that adaptively adjusts quantization across layers, significantly reducing errors and outperforming existing methods.

Contribution

We propose SliderQuant, a layer-sensitive PTQ method with adaptive sliding quantization, improving accuracy for various LLMs over existing techniques.

Findings

01

Outperforms existing PTQ methods on multiple LLM benchmarks.

02

Effectively reduces quantization errors across different layers.

03

Works well with weight-only and weight-activation quantization.

Abstract

In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding-layer…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- Depth-aware sliding window actually makes early and late layers easier to quantize, instead of treating all layer depths the same. - Inter-layer and intra-layer sliding reinforce each other, so you get denser cross-layer synergy as compared to a fixed window. - On MoE (Table 4) it improves over OmniQuant at every bit setting, which helps generalizable, not tuned for one model claim. - Generation Table 5 is especially strong, 2-bit OmniQuant nearly collapses on DeepSeek-R1 distilled models,

Weaknesses

- The method adds several schedule knobs (expand depth, contract depth, window size, γ), and robustness to non-ideal choices is not fully presented in the paper. - Comparisons are mostly against fixed-window, non-rotated post training quantization techniques. It’s unclear how much of the gains remain vs the strongest rotation/equivalent methods.

Reviewer 02Rating 6Confidence 3

Strengths

1. Clear and strong motivation: The paper is motivated by an empirically grounded observation on layer-wise sensitivity to quantization in LLMs. The motivation is clearly presented and addresses an overlooked aspect in post-training quantization. 2. Comprehensive experiments: The evaluation covers multiple model families and various bit-width settings, demonstrating the generality of the proposed framework. 3. Intuitive and well-written method: The proposed sliding-layer quantization framework

Weaknesses

1. Uneven optimization frequency of middle layers: According to Figure 1 and the default hyperparameter setting, the 4th and 5th layers appear to be quantized only once. This means that some middle layers receive fewer optimization passes than their neighbors. Could this uneven optimization frequency introduce instability or suboptimal performance? In particular, when the middle-layer window size is larger than two, how do you ensure that all middle layers are optimized an equal number of times?

Reviewer 03Rating 4Confidence 4

Strengths

1. The authors identify varying sensitivities of different layers to quantization and improve the quantization performance of layers with different sensitivities through a sliding-window design, rather than directly adopting a mixed-precision approach. This provides a novel and interesting perspective. 2. The writing is clear and well-structured, the experiments are thorough, and the figures and tables are elegantly designed.

Weaknesses

1. The description of intra-layer sliding quantization is the main weakness of the paper. As one of the core innovations, its explanation is too brief, which makes it confusing. Does it mean that the weights/activation matrices are also partitioned and quantized sequentially within each layer? 2. I'm afraid that whether the effectiveness of both learnable low-rank matrices A and B will be influenced after quantization because they have been integrated into weights before quantization during infe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques