Elastic On-Device LLM Service

Wangsong Yin; Rongjie Yi; Daliang Xu; Gang Huang; Mengwei Xu; Xuanzhe Liu

arXiv:2409.09071·cs.DC·October 7, 2025

Elastic On-Device LLM Service

Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu

PDF

Open Access

TL;DR

This paper introduces \\sys, an elastic on-device LLM service that dynamically adjusts model and prompt configurations to meet diverse latency SLOs, improving accuracy and efficiency on smartphones.

Contribution

It proposes a novel elasticization method for on-device LLMs combining neuron reordering and a dual-head tiny model, enabling flexible SLO adherence with minimal overhead.

Findings

01

the system outperforms baselines in accuracy by up to 14.83%

02

the system maintains less than 1% switching overhead

03

the implementation is feasible on commercial smartphones

Abstract

On-device Large Language Models (LLMs) are transforming mobile AI, catalyzing applications like UI automation without privacy concerns. Nowadays the common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. We identify a key system challenge in this paradigm: current LLMs lack the elasticity to serve requests that have diversified Service-Level Objectives (SLOs) on inference latency. To tackle this, we present \sys, an on-device LLM service that elasticizes both the model and the prompt dimension of a full LLM. It incorporates (1) a one-shot neuron-reordering method, which leverages the intrinsic permutation consistency in transformer models to generate high-quality elasticized sub-models with minimal runtime switching overhead; (2) a dual-head tiny language model, which efficiently and effectively refines the prompt and orchestrates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Topic Modeling

Methodstravel james