Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

Yilun Luo; Huaqing Zheng; Haoqian Meng; Wenyuan Liu; Peng Zhang

arXiv:2512.23367·cs.LG·January 9, 2026

Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

Yilun Luo, Huaqing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang

PDF

Open Access

TL;DR

This paper presents a low-bit quantization framework for openPangu models, enabling efficient deployment of Chain-of-Thought reasoning on Ascend A2 hardware with minimal accuracy loss and speed improvements.

Contribution

It introduces a unified low-bit inference framework supporting INT8 and W4A8 quantization for openPangu models, optimizing their deployment on Ascend NPUs.

Findings

01

INT8 quantization preserves over 90% of FP16 accuracy

02

Achieves 1.5x prefill speedup on Atlas A2

03

W4A8 reduces memory consumption with moderate accuracy trade-off

Abstract

Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B are variants of the openPangu large language model, designed for efficient deployment on Ascend NPUs. The 7B variant supports three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think, while the 1B variant operates exclusively in the no_think mode, which employs condensed reasoning for higher efficiency. Although CoT reasoning enhances capability, the generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Multimodal Machine Learning Applications