Transformer-Lite: High-efficiency Deployment of Large Language Models on   Mobile Phone GPUs

Luchang Li; Sheng Qian; Jie Lu; Lunxi Yuan; Rui Wang; Qin Xie

arXiv:2403.20041·cs.CL·July 8, 2024·1 cites

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

PDF

Open Access

TL;DR

Transformer-Lite introduces four optimization techniques to enable high-efficiency deployment of large language models on mobile phone GPUs, significantly improving inference speed and user experience.

Contribution

The paper presents a novel mobile inference engine, Transformer-Lite, with four key optimizations for fast LLM inference on device GPUs, compatible with Qualcomm and MTK processors.

Findings

01

Achieved 10x speedup over CPU-based FastLLM.

02

Attained 2-3x faster decoding speeds compared to GPU-based MLC-LLM.

03

Supported LLMs from 2B to 14B parameters with high efficiency.

Abstract

The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbolic expression-based approach to support dynamic shape model inference; (b) operator optimizations and execution priority setting to enhance inference speed and reduce phone lagging; (c) an FP4 quantization method termed M0E4 to reduce dequantization overhead; (d) a sub-tensor-based technique to eliminate the need for copying KV cache after LLM inference. Furthermore, we implement these methods in our mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Context-Aware Activity Recognition Systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings