EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li

TL;DR
EdgeLoRA is a system designed to efficiently serve large language models on resource-limited edge devices in multi-tenant environments by optimizing adapter selection, memory management, and batch inference, significantly improving latency and throughput.
Contribution
The paper introduces EdgeLoRA, a novel system that combines adaptive adapter selection, heterogeneous memory management, and batch inference to enhance LLM serving efficiency on edge devices.
Findings
Up to 4x increase in throughput compared to existing solutions.
Able to serve several orders of magnitude more adapters simultaneously.
Significant reduction in latency and memory overhead.
Abstract
Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
