MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

Zhaode Wang; Jingbang Yang; Xinyu Qian; Shiwen Xing; Xiaotang Jiang; Chengfei Lv; Shengyu Zhang

arXiv:2506.10443·cs.LG·June 13, 2025

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

Zhaode Wang, Jingbang Yang, Xinyu Qian, Shiwen Xing, Xiaotang Jiang, Chengfei Lv, Shengyu Zhang

PDF

TL;DR

MNN-LLM is a versatile inference engine that significantly accelerates large language model deployment on mobile devices by optimizing memory usage and computational efficiency, enabling faster inference with reduced resource consumption.

Contribution

The paper introduces MNN-LLM, a novel framework that combines model quantization, hybrid storage, and hardware-aware optimizations to improve LLM inference speed on mobile devices.

Findings

01

Achieves up to 8.6x speedup over existing frameworks.

02

Reduces memory usage through model quantization and hybrid storage.

03

Enhances performance with hardware-aware weight and input rearrangement.

Abstract

Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings