A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Yuchen Luo; Fangyue Zhu; Ruining Zhou; Mingzhe Huang; Jian Zhu; Fanyu Fan; Wei Shao

arXiv:2602.17693·cs.LG·February 23, 2026

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao

PDF

Open Access

TL;DR

This paper evaluates various post-training quantization methods for reasoning large language models on Ascend NPU, revealing platform-specific sensitivities and practical deployment insights.

Contribution

It provides a comprehensive case study of PTQ algorithms on reasoning LLMs for Ascend NPU, highlighting platform sensitivities and deployment challenges.

Findings

01

4-bit weight-only quantization is viable for large models.

02

Aggressive 4-bit weight-activation quantization causes calibration instability.

03

8-bit quantization remains stable and practical.

Abstract

Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Neural Network Applications