EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
Yongsheng Yan, Jiacheng Shen, Xuchuan Luo, Yangfan Zhou

TL;DR
EdgeFlow is a mobile LLM inference framework that significantly reduces cold start latency by adaptively adjusting parameter precisions and optimizing data formats for NPUs.
Contribution
It introduces an NPU-aware adaptive quantization and a synergistic pipeline to mitigate cold start latency in mobile LLM inference.
Findings
Reduces cold-start latency by up to 4.07x
Outperforms state-of-the-art frameworks in latency with comparable accuracy
Employs adaptive precision quantization tailored for NPUs
Abstract
Deploying large language models (LLMs) on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
