Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning

Zhen Li; Yupeng Su; Songmiao Wang; Runming Yang; Congkai Xie; Aofan Liu; Ming Li; Jiannong Cao; Yuan Xie; Ngai Wong; Hongxia Yang

arXiv:2505.11574·cs.LG·January 21, 2026

Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning

Zhen Li, Yupeng Su, Songmiao Wang, Runming Yang, Congkai Xie, Aofan Liu, Ming Li, Jiannong Cao, Yuan Xie, Ngai Wong, Hongxia Yang

PDF

3 Reviews

TL;DR

This paper investigates how low-bit quantization degrades mathematical reasoning in large language models, identifies early failure points, and proposes a targeted fine-tuning method to mitigate these issues effectively.

Contribution

It uncovers regularities in quantization-induced errors and introduces a lightweight, model-agnostic intervention to restore reasoning accuracy with minimal data and compute.

Findings

01

Early failure points cause cascading errors in reasoning.

02

Restoring local token-level margins improves reasoning accuracy.

03

Few-shot tuning recovers near full-precision performance.

Abstract

Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5--7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-written and easy to follow. 2. The work addresses a highly practical problem. The deployment of LLMs on resource-constrained hardware is a significant bottleneck, and understanding and rectifying performance degradation from essential compression techniques like quantization is of great importance to the field. 3. A key contribution is the detailed, step-aligned error analysis. The finding that quantization disproportionately increase method errors and execution errors, and

Weaknesses

1. The definitions for the four high-level error types (Conceptual, Method, Execution, Reasoning) appear to have some overlap, which could lead to subjective classification. For example, misusing a formula in an unsuitable context could be interpreted as a Conceptual Error (misunderstanding the problem's constraints) or a Method Error (choosing an inappropriate method).

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper tackles the highly relevant challenge of deploying powerful LLMs in resource-constrained environments by making them efficient through quantization, while simultaneously preserving their reasoning capabilities. 2. The paper widely used PTQ methods (AWQ, GPTQ, SmoothQuant) and mathematical benchmarks (GSM8K, MATH, AIME). 3. The paper shows that significant recovery of mathematical reasoning accuracy can be achieved with as few as 332 curated examples and 3-5 minutes of GPU compute, w

Weaknesses

1. While the paper claims its framework is quantizer- and architecture-agnostic and applicable to "broader domains," the experiments are exclusively focused on mathematical reasoning tasks and a specific set of LLM families and PTQ methods. Further evidence would strengthen the broader generalizability claim. 2. The paper studies a limited variety of models, and for mathematical reasoning tasks, long CoT models are more mainstream but were not investigated. Furthermore, there was no analysis of

Reviewer 03Rating 6Confidence 3

Strengths

1. The studied problem, the performance degradation from quantization in math reasoning, is important. The finding that such degradations are non-trivial, especially for smaller LLMs, is interesting. 2. A fine-grained error analysis is designed and conducted, which sheds light on why and how quantization can lead to performance degradation. It shows that quantization predominantly impairs the model’s ability to perform procedural operations and arithmetic execution. 3. The proposed training me

Weaknesses

1. It would be helpful to present the (potential) performance degradation from quantization on tasks other than math reasoning (apart from MMLU), to provide a comparison regarding whether and how much larger performance degradation is observed on math reasoning. 2. The paper only experimented with models of sizes up to 7B parameters. Stronger and more larger models should be included to strengthen the comprehensiveness of the empirical evaluations of the proposed methods.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.