ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
Yesheng Liang, Haisheng Chen, Zihan Zhang, Song Han, Zhijian Liu

TL;DR
ParoQuant introduces a novel pairwise rotation quantization technique that effectively reduces outliers in large language models, improving inference accuracy with minimal overhead, thereby enabling more efficient deployment of reasoning LLMs.
Contribution
The paper proposes ParoQuant, a new PTQ method combining Givens rotations and channel-wise scaling to address outliers and improve accuracy in LLM inference.
Findings
Achieves 2.4% accuracy improvement over AWQ on reasoning tasks.
Maintains less than 10% inference overhead.
Matches state-of-the-art quantization accuracy.
Abstract
Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitudes across channels and narrow the dynamic range within each quantization group, effectively addressing the outlier issue. We further co-design the…
Peer Reviews
Decision·ICLR 2026 Poster
The authors convincingly argue that reasoning LLMs are especially sensitive to accumulated quantization errors, providing strong justification for the proposed method’s focus on accuracy stability during long generation. ParoQuant achieves higher reasoning-task accuracy than AWQ and matches the state-of-the-art QTIP while being significantly faster. The paper thoughtfully co-designs the quantization algorithm and CUDA implementation.
Please see my questions.
1. The paper clearly identifies and addresses a critical, forward-looking problem: the poor performance of efficient quantization methods on reasoning tasks that require long chains of thought. This focus on error accumulation in generative tasks is timely and important. 2. The proposed "scaled pairwise rotation" is a novel and elegant solution. The insight that a full rotation matrix is redundant and can be effectively approximated by a series of independent, parallelizable Givens rotations is
1. The greedy pair selection strategy outlined in Algorithm A1, while effective and intuitive, may not be globally optimal. It would be beneficial for the authors to discuss the potential limitations of this greedy approach. 2. In Section 3, when discussing quantization degradation on reasoning tasks, the authors should cite other recent works that have also identified this specific problem (e.g., QSPEC) to better contextualize their motivation. 3. In Figure 3, some text labels in the right-mo
1. Clear and Well-Founded Motivation: The paper observes that rotating only the top 10% of the most significant weight channel pairs can achieve nearly the same reduction in quantization error as performing a full rotation. This insight eliminates a large amount of redundant computation from full matrix multiplications, leading to a much more efficient quantization process. 2. Methodology with GPU-Aware Design: Building on this motivation, the authors propose a three-step design for the scaled
Overall, I found the paper well-written and technically solid. The following are just minor curiosities rather than critical weaknesses: 1. The 4-bit performance gains appear somewhat modest for certain model sizes and tasks (e.g., Perplexity and AIME). It would be interesting to see whether ParaQuant delivers more substantial improvements at lower bitwidths, such as 3-bit or 2-bit quantization. 2. Do you have any insight into why E-QAT performs particularly poorly on AIME, given that ParaQuan
Code & Models
- 🤗z-lab/Qwen3.5-27B-PAROmodel· 2.0k dl· ♡ 132.0k dl♡ 13
- 🤗z-lab/Qwen3.5-35B-A3B-PAROmodel· 62 dl· ♡ 362 dl♡ 3
- 🤗z-lab/Qwen3.5-4B-PAROmodel· 16k dl· ♡ 1416k dl♡ 14
- 🤗z-lab/Meta-Llama-3-8B-PAROmodel· 187 dl· ♡ 1187 dl♡ 1
- 🤗z-lab/Llama-3.1-8B-Instruct-PAROmodel· 215 dl· ♡ 1215 dl♡ 1
- 🤗z-lab/Llama-2-7b-hf-PAROmodel· 320 dl· ♡ 1320 dl♡ 1
- 🤗z-lab/Qwen3-0.6B-PAROmodel· 436 dl· ♡ 1436 dl♡ 1
- 🤗z-lab/Qwen3-1.7B-PAROmodel· 278 dl· ♡ 1278 dl♡ 1
- 🤗z-lab/Qwen3-4B-PAROmodel· 625 dl· ♡ 1625 dl♡ 1
- 🤗z-lab/Qwen3-8B-PAROmodel· 561 dl· ♡ 1561 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
