Fast On-device LLM Inference with NPUs

Daliang Xu; Hao Zhang; Liming Yang; Ruiqi Liu; Gang Huang; Mengwei Xu,; Xuanzhe Liu

arXiv:2407.05858·cs.AI·December 17, 2024·1 cites

Fast On-device LLM Inference with NPUs

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu,, Xuanzhe Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces llm.npu, a system that leverages on-device NPUs to significantly reduce inference latency for mobile-sized LLMs, enabling faster and more energy-efficient on-device language processing.

Contribution

It presents a novel multi-level prompt and model reconstruction approach that optimizes NPU offloading for LLM inference on mobile devices.

Findings

01

22.4x faster prefill speed

02

30.7× energy savings

03

Over 1,000 tokens/sec prefilling for billion-sized models

Abstract

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ubiquitouslearning/mllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Line Communications and Noise · VLSI and Analog Circuit Testing

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam