ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

TL;DR
ShadowNPU introduces shadowAttn, a system-algorithm co-designed sparse attention module that minimizes CPU/GPU reliance for efficient on-device LLM inference, enhancing privacy and user experience.
Contribution
It presents a novel sparse attention method with NPU-based token importance estimation, improving efficiency and accuracy over existing frameworks.
Findings
ShadowAttn achieves high accuracy with minimal CPU/GPU resources.
It outperforms state-of-the-art frameworks in on-device LLM inference.
The system reduces CPU/GPU fallback, improving user experience.
Abstract
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
