FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu; Yitong Chen; Lixiang Wang; Jinyu Gu; Zhichao Hua; Yubin Xia

arXiv:2603.09046·cs.CR·April 23, 2026

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

PDF

TL;DR

FlexServe is a novel system that enables fast, secure, and resource-efficient LLM inference on mobile devices by leveraging flexible resource isolation within TrustZone, significantly improving inference speed and multi-model workflow performance.

Contribution

It introduces a Flexible Resource Isolation mechanism and a comprehensive LLM inference framework within TrustZone, enhancing security and efficiency on mobile devices.

Findings

01

Achieves 10.05× speedup in Time to First Token over strawman designs.

02

Attains 24.30× end-to-end speedup in multi-model workflows.

03

Demonstrates significant performance improvements with prototype implementation.

Abstract

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.