Accelerating OpenPangu Inference on NPU via Speculative Decoding

Yuntao Dai; Jing Wu; Hang Gu; Teng Wang

arXiv:2603.03383·cs.DC·March 5, 2026

Accelerating OpenPangu Inference on NPU via Speculative Decoding

Yuntao Dai, Jing Wu, Hang Gu, Teng Wang

PDF

Open Access

TL;DR

This paper proposes a speculative decoding acceleration scheme for OpenPangu-7B to improve inference speed on NPU hardware, addressing memory bottlenecks and lack of native support for such algorithms.

Contribution

It introduces an end-to-end speculative inference acceleration method tailored for OpenPangu-7B on NPU hardware, overcoming existing hardware limitations.

Findings

01

Significant speedup in inference time on NPU hardware

02

Effective mitigation of memory wall bottleneck

03

Compatibility with mainstream speculative decoding algorithms

Abstract

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques