Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts
Jin Yang, Qiong Wu, Zhiying Feng, Zhi Zhou, Deke Guo, Xu Chen

TL;DR
This paper introduces a DRL-based framework for routing user requests to edge LLMs, optimizing quality-of-service and resource efficiency amidst heterogeneity and dynamic workloads.
Contribution
It presents a novel DRL-based routing approach with dynamic state abstraction and impact estimation for stable QoS in edge LLM services.
Findings
Significant QoS improvement over baselines
Enhanced resource efficiency in LLM routing
Effective handling of workload heterogeneity
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services. However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns. Therefore, multiple LLMs are usually deployed at the network edge to boost real-time responsiveness and protect data privacy, particularly for many emerging smart mobile and IoT applications. Given the varying response quality and latency of LLM services, a critical issue is how to route user requests from mobile and IoT devices to an appropriate LLM service (i.e., edge LLM expert) to ensure acceptable quality-of-service (QoS). Existing routing algorithms fail to simultaneously address the heterogeneity of LLM services, the interference among requests, and the dynamic workloads necessary for maintaining long-term stable QoS. To meet these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Mobile Crowdsensing and Crowdsourcing
