ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin; Daliang Xu; Mengwei Xu; Gang Huang; Xuanzhe Liu

arXiv:2508.16703·cs.PF·April 9, 2026

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

PDF

TL;DR

ShadowNPU introduces shadowAttn, a system-algorithm co-designed sparse attention module that minimizes CPU/GPU reliance for efficient on-device LLM inference, enhancing privacy and user experience.

Contribution

It presents a novel sparse attention method with NPU-based token importance estimation, improving efficiency and accuracy over existing frameworks.

Findings

01

ShadowAttn achieves high accuracy with minimal CPU/GPU resources.

02

It outperforms state-of-the-art frameworks in on-device LLM inference.

03

The system reduces CPU/GPU fallback, improving user experience.

Abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.