Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Afsara Benazir; Felix Xiaozhu Lin

arXiv:2604.18788·cs.LG·April 22, 2026

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Afsara Benazir, Felix Xiaozhu Lin

PDF

TL;DR

This paper introduces NPUMoE, a runtime engine that accelerates Mixture-of-Experts LLM inference on Apple Silicon NPUs by optimizing expert routing and execution, significantly improving latency and energy efficiency.

Contribution

NPUMoE enables efficient offloading of MoE inference to Apple Silicon NPUs, overcoming dynamic routing and irregular operator challenges with novel static and load-aware techniques.

Findings

01

NPUMoE reduces latency by up to 5.55x

02

Energy efficiency improves by up to 7.37x

03

CPU-cycle usage decreases by up to 5.54x

Abstract

Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.