PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services
Zheming Yang, Yuanhao Yang, Chang Zhao, Qi Guo, Wenkai He, and Wen Ji

TL;DR
PerLLM is a personalized edge-cloud inference scheduling framework for diverse large language model services that optimizes resource allocation, reduces energy costs, and improves throughput under various constraints.
Contribution
It introduces a novel scheduling framework using an upper confidence bound algorithm for efficient edge-cloud collaboration in LLM services.
Findings
Achieves 2.2x, 2.1x, and 1.6x throughput improvements
Reduces energy costs by over 50%
Effectively meets personalized processing time requirements
Abstract
With the rapid growth in the number of large language model (LLM) users, it is difficult for bandwidth-constrained cloud servers to simultaneously process massive LLM services in real-time. Recently, edge-cloud infrastructures have been used to improve the processing efficiency of large-scale LLM services. However, the diversity of task requirements and the dynamics of resources pose great challenges to inference scheduling, leading to the wastage of many resources. In this paper, we present PerLLM, a personalized inference scheduling framework with edge-cloud collaboration designed for diverse LLM services. For the complexity of multiple constraints and the decision-making process of edge-cloud collaboration, we integrate the upper confidence bound algorithm based on the constraint satisfaction mechanism in PerLLM. For diverse LLM services, PerLLM can optimize service scheduling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Scientific Computing and Data Management · IoT and Edge/Fog Computing
