Elastic On-Device LLM Service
Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu

TL;DR
This paper introduces \\sys, an elastic on-device LLM service that dynamically adjusts model and prompt configurations to meet diverse latency SLOs, improving accuracy and efficiency on smartphones.
Contribution
It proposes a novel elasticization method for on-device LLMs combining neuron reordering and a dual-head tiny model, enabling flexible SLO adherence with minimal overhead.
Findings
the system outperforms baselines in accuracy by up to 14.83%
the system maintains less than 1% switching overhead
the implementation is feasible on commercial smartphones
Abstract
On-device Large Language Models (LLMs) are transforming mobile AI, catalyzing applications like UI automation without privacy concerns. Nowadays the common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. We identify a key system challenge in this paradigm: current LLMs lack the elasticity to serve requests that have diversified Service-Level Objectives (SLOs) on inference latency. To tackle this, we present \sys, an on-device LLM service that elasticizes both the model and the prompt dimension of a full LLM. It incorporates (1) a one-shot neuron-reordering method, which leverages the intrinsic permutation consistency in transformer models to generate high-quality elasticized sub-models with minimal runtime switching overhead; (2) a dual-head tiny language model, which efficiently and effectively refines the prompt and orchestrates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Topic Modeling
Methodstravel james
