EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

Yongsheng Yan; Jiacheng Shen; Xuchuan Luo; Yangfan Zhou

arXiv:2604.09083·cs.OS·April 13, 2026

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

Yongsheng Yan, Jiacheng Shen, Xuchuan Luo, Yangfan Zhou

PDF

TL;DR

EdgeFlow is a mobile LLM inference framework that significantly reduces cold start latency by adaptively adjusting parameter precisions and optimizing data formats for NPUs.

Contribution

It introduces an NPU-aware adaptive quantization and a synergistic pipeline to mitigate cold start latency in mobile LLM inference.

Findings

01

Reduces cold-start latency by up to 4.07x

02

Outperforms state-of-the-art frameworks in latency with comparable accuracy

03

Employs adaptive precision quantization tailored for NPUs

Abstract

Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.