MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices
Zhaode Wang, Jingbang Yang, Xinyu Qian, Shiwen Xing, Xiaotang Jiang, Chengfei Lv, Shengyu Zhang

TL;DR
MNN-LLM is a versatile inference engine that significantly accelerates large language model deployment on mobile devices by optimizing memory usage and computational efficiency, enabling faster inference with reduced resource consumption.
Contribution
The paper introduces MNN-LLM, a novel framework that combines model quantization, hybrid storage, and hardware-aware optimizations to improve LLM inference speed on mobile devices.
Findings
Achieves up to 8.6x speedup over existing frameworks.
Reduces memory usage through model quantization and hybrid storage.
Enhances performance with hardware-aware weight and input rearrangement.
Abstract
Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
