Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
Afsara Benazir, Felix Xiaozhu Lin

TL;DR
This paper introduces NPUMoE, a runtime engine that accelerates Mixture-of-Experts LLM inference on Apple Silicon NPUs by optimizing expert routing and execution, significantly improving latency and energy efficiency.
Contribution
NPUMoE enables efficient offloading of MoE inference to Apple Silicon NPUs, overcoming dynamic routing and irregular operator challenges with novel static and load-aware techniques.
Findings
NPUMoE reduces latency by up to 5.55x
Energy efficiency improves by up to 7.37x
CPU-cycle usage decreases by up to 5.54x
Abstract
Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
