A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao

TL;DR
This paper evaluates various post-training quantization methods for reasoning large language models on Ascend NPU, revealing platform-specific sensitivities and practical deployment insights.
Contribution
It provides a comprehensive case study of PTQ algorithms on reasoning LLMs for Ascend NPU, highlighting platform sensitivities and deployment challenges.
Findings
4-bit weight-only quantization is viable for large models.
Aggressive 4-bit weight-activation quantization causes calibration instability.
8-bit quantization remains stable and practical.
Abstract
Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Neural Network Applications
