Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2
Yilun Luo, Huaqing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang

TL;DR
This paper presents a low-bit quantization framework for openPangu models, enabling efficient deployment of Chain-of-Thought reasoning on Ascend A2 hardware with minimal accuracy loss and speed improvements.
Contribution
It introduces a unified low-bit inference framework supporting INT8 and W4A8 quantization for openPangu models, optimizing their deployment on Ascend NPUs.
Findings
INT8 quantization preserves over 90% of FP16 accuracy
Achieves 1.5x prefill speedup on Atlas A2
W4A8 reduces memory consumption with moderate accuracy trade-off
Abstract
Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B are variants of the openPangu large language model, designed for efficient deployment on Ascend NPUs. The 7B variant supports three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think, while the 1B variant operates exclusively in the no_think mode, which employs condensed reasoning for higher efficiency. Although CoT reasoning enhances capability, the generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Multimodal Machine Learning Applications
