ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang; Haisheng Chen; Zihan Zhang; Song Han; Zhijian Liu

arXiv:2511.10645·cs.CL·February 17, 2026

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang, Haisheng Chen, Zihan Zhang, Song Han, Zhijian Liu

PDF

Open Access 10 Models 3 Reviews

TL;DR

ParoQuant introduces a novel pairwise rotation quantization technique that effectively reduces outliers in large language models, improving inference accuracy with minimal overhead, thereby enabling more efficient deployment of reasoning LLMs.

Contribution

The paper proposes ParoQuant, a new PTQ method combining Givens rotations and channel-wise scaling to address outliers and improve accuracy in LLM inference.

Findings

01

Achieves 2.4% accuracy improvement over AWQ on reasoning tasks.

02

Maintains less than 10% inference overhead.

03

Matches state-of-the-art quantization accuracy.

Abstract

Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitudes across channels and narrow the dynamic range within each quantization group, effectively addressing the outlier issue. We further co-design the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The authors convincingly argue that reasoning LLMs are especially sensitive to accumulated quantization errors, providing strong justification for the proposed method’s focus on accuracy stability during long generation. ParoQuant achieves higher reasoning-task accuracy than AWQ and matches the state-of-the-art QTIP while being significantly faster. The paper thoughtfully co-designs the quantization algorithm and CUDA implementation.

Weaknesses

Please see my questions.

Reviewer 02Rating 8Confidence 5

Strengths

1. The paper clearly identifies and addresses a critical, forward-looking problem: the poor performance of efficient quantization methods on reasoning tasks that require long chains of thought. This focus on error accumulation in generative tasks is timely and important. 2. The proposed "scaled pairwise rotation" is a novel and elegant solution. The insight that a full rotation matrix is redundant and can be effectively approximated by a series of independent, parallelizable Givens rotations is

Weaknesses

1. The greedy pair selection strategy outlined in Algorithm A1, while effective and intuitive, may not be globally optimal. It would be beneficial for the authors to discuss the potential limitations of this greedy approach. 2. In Section 3, when discussing quantization degradation on reasoning tasks, the authors should cite other recent works that have also identified this specific problem (e.g., QSPEC) to better contextualize their motivation. 3. In Figure 3, some text labels in the right-mo

Reviewer 03Rating 6Confidence 5

Strengths

1. Clear and Well-Founded Motivation: The paper observes that rotating only the top 10% of the most significant weight channel pairs can achieve nearly the same reduction in quantization error as performing a full rotation. This insight eliminates a large amount of redundant computation from full matrix multiplications, leading to a much more efficient quantization process. 2. Methodology with GPU-Aware Design: Building on this motivation, the authors propose a three-step design for the scaled

Weaknesses

Overall, I found the paper well-written and technically solid. The following are just minor curiosities rather than critical weaknesses: 1. The 4-bit performance gains appear somewhat modest for certain model sizes and tasks (e.g., Perplexity and AIME). It would be interesting to see whether ParaQuant delivers more substantial improvements at lower bitwidths, such as 3-bit or 2-bit quantization. 2. Do you have any insight into why E-QAT performs particularly poorly on AIME, given that ParaQuan

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling