LLM as a System Service on Mobile Devices
Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu

TL;DR
This paper introduces LLMS, a system that enables efficient on-device execution of large language models by optimizing memory management and KV cache handling, significantly reducing context switching latency on mobile devices.
Contribution
It proposes a novel memory management framework for LLMs on mobile devices, including three techniques: tolerance-aware compression, IO-recompute pipelining, and chunk lifecycle management.
Findings
Reduces context switching latency by up to 100x.
Achieves efficient KV cache compression with minimal accuracy loss.
Demonstrates effectiveness on various edge devices and traces.
Abstract
Being more powerful and intrusive into user-device interactions, LLMs are eager for on-device execution to better preserve user privacy. In this work, we propose a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS). Unlike traditional DNNs that execute in a stateless manner, such a system service is stateful: LLMs execution often needs to maintain persistent states (mainly KV cache) across multiple invocations. To minimize the LLM context switching overhead under tight device memory budget, this work presents LLMS, which decouples the memory management of app and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache's unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis · Wireless Sensor Networks for Data Analysis · Advanced Computational Techniques and Applications
