TL;DR
This paper investigates how low-bit quantization degrades mathematical reasoning in large language models, identifies early failure points, and proposes a targeted fine-tuning method to mitigate these issues effectively.
Contribution
It uncovers regularities in quantization-induced errors and introduces a lightweight, model-agnostic intervention to restore reasoning accuracy with minimal data and compute.
Findings
Early failure points cause cascading errors in reasoning.
Restoring local token-level margins improves reasoning accuracy.
Few-shot tuning recovers near full-precision performance.
Abstract
Low-bit post-training quantization (PTQ) is a practical route to deploy reasoning-capable LLMs under tight memory and latency budgets, yet it can markedly impair mathematical reasoning (drops up to 69.81% in our harder settings). We address two deployment-critical questions with process-level precision: Where along a step-structured solution does degradation first arise? How to mitigate it while staying in the low-bit regime? Across widely used PTQ methods (AWQ, GPTQ, SmoothQuant), open-source model families (Qwen, LLaMA; 0.5--7B), and math reasoning benchmarks (GSM8K, MATH, AIME), we perform format-aligned chain-of-thought with step-aligned attribution and uncover two robust regularities: (i) PTQ disproportionately elevates method and execution errors relative to high-level conceptual mistakes; and (ii) failures emerge early, with the first vulnerable step flipping and cascading to the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-written and easy to follow. 2. The work addresses a highly practical problem. The deployment of LLMs on resource-constrained hardware is a significant bottleneck, and understanding and rectifying performance degradation from essential compression techniques like quantization is of great importance to the field. 3. A key contribution is the detailed, step-aligned error analysis. The finding that quantization disproportionately increase method errors and execution errors, and
1. The definitions for the four high-level error types (Conceptual, Method, Execution, Reasoning) appear to have some overlap, which could lead to subjective classification. For example, misusing a formula in an unsuitable context could be interpreted as a Conceptual Error (misunderstanding the problem's constraints) or a Method Error (choosing an inappropriate method).
1. The paper tackles the highly relevant challenge of deploying powerful LLMs in resource-constrained environments by making them efficient through quantization, while simultaneously preserving their reasoning capabilities. 2. The paper widely used PTQ methods (AWQ, GPTQ, SmoothQuant) and mathematical benchmarks (GSM8K, MATH, AIME). 3. The paper shows that significant recovery of mathematical reasoning accuracy can be achieved with as few as 332 curated examples and 3-5 minutes of GPU compute, w
1. While the paper claims its framework is quantizer- and architecture-agnostic and applicable to "broader domains," the experiments are exclusively focused on mathematical reasoning tasks and a specific set of LLM families and PTQ methods. Further evidence would strengthen the broader generalizability claim. 2. The paper studies a limited variety of models, and for mathematical reasoning tasks, long CoT models are more mainstream but were not investigated. Furthermore, there was no analysis of
1. The studied problem, the performance degradation from quantization in math reasoning, is important. The finding that such degradations are non-trivial, especially for smaller LLMs, is interesting. 2. A fine-grained error analysis is designed and conducted, which sheds light on why and how quantization can lead to performance degradation. It shows that quantization predominantly impairs the model’s ability to perform procedural operations and arithmetic execution. 3. The proposed training me
1. It would be helpful to present the (potential) performance degradation from quantization on tasks other than math reasoning (apart from MMLU), to provide a comparison regarding whether and how much larger performance degradation is observed on math reasoning. 2. The paper only experimented with models of sizes up to 7B parameters. Stronger and more larger models should be included to strengthen the comprehensiveness of the empirical evaluations of the proposed methods.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
