Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
Le Chen, Dahu Feng, Erhu Feng, Yingrui Wang, Rong Zhao, Yubin Xia, Pinjie Xu, Haibo Chen

TL;DR
This paper introduces HeteroInfer, a novel mobile LLM inference engine that efficiently utilizes both GPU and NPU accelerators, achieving significant speedups by exploiting heterogeneous processing and memory bandwidth.
Contribution
It presents a comprehensive characterization of mobile SoC heterogeneity and proposes mechanisms for joint GPU-NPU utilization, improving LLM inference performance on mobile devices.
Findings
HeteroInfer achieves up to 6.02x speedup over existing engines.
The proposed synchronization mechanism reduces overhead.
Efficient resource utilization with negligible impact on other apps.
Abstract
With the rapid advancement of artificial intelligence technologies such as ChatGPT, AI agents, and video generation, contemporary mobile systems have begun integrating these AI capabilities on local devices to enhance privacy and reduce response latency. To meet the computational demands of AI tasks, current mobile SoCs are equipped with diverse AI accelerators, including GPUs and Neural Processing Units (NPUs). However, there has not been a comprehensive characterization of these heterogeneous processors, and existing designs typically only leverage a single AI accelerator for LLM inference, leading to suboptimal use of computational resources and memory bandwidth. In this paper, we first summarize key performance characteristics of heterogeneous processors, SoC memory bandwidth, etc. Drawing on these observations, we propose different heterogeneous parallel mechanisms to fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques
