FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

TL;DR
FlexServe is a novel system that enables fast, secure, and resource-efficient LLM inference on mobile devices by leveraging flexible resource isolation within TrustZone, significantly improving inference speed and multi-model workflow performance.
Contribution
It introduces a Flexible Resource Isolation mechanism and a comprehensive LLM inference framework within TrustZone, enhancing security and efficiency on mobile devices.
Findings
Achieves 10.05× speedup in Time to First Token over strawman designs.
Attains 24.30× end-to-end speedup in multi-model workflows.
Demonstrates significant performance improvements with prototype implementation.
Abstract
Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
