NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers
Sarunas Kalade, Graham Schelle

TL;DR
NPUEval introduces a benchmark for NPU kernel optimization, evaluating LLM-generated code on real hardware, revealing significant challenges and progress in automating efficient kernel development.
Contribution
The paper presents NPUEval, a new benchmark dataset and evaluation framework for NPU kernel code generation using LLMs, addressing the lack of specialized benchmarks in this domain.
Findings
DeepSeek R1 achieves over 50% vectorization on some kernels
Average vectorization score across dataset is about 10%
Open source tools enable functional correctness and efficiency evaluation
Abstract
Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. Unlike GPU programming, which has had years to mature, NPU programming is new, with smaller and more fragmented developer communities across hardware platforms. This fragmentation poses a challenge when utilizing LLMs to assist in writing NPU kernels, as domain-specific optimized code examples are underrepresented in LLM pre-training data. In this paper we introduce NPUEval -- a benchmark for writing and evaluating NPU kernels, consisting of 102 common operators for machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
