Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing
Shibo Yu, Mohammad Goudarzi, and Adel Nadjaran Toosi

TL;DR
This paper presents an adaptive, multi-objective routing algorithm using NSGA-II to efficiently distribute LLM inference requests across heterogeneous cloud-edge environments, balancing quality, latency, and cost.
Contribution
It introduces a novel multi-objective optimization approach for LLM request routing that adapts to workload heterogeneity and resource diversity, improving scalability and efficiency.
Findings
Preserves 95.2% of cloud-only response quality
Reduces inference cost by 34.9%
Maintains slight latency increase
Abstract
The rising demand for Large Language Model (LLM) inference services has intensified pressure on computational resources, resulting in latency and cost challenges. This paper introduces a novel routing algorithm based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to distribute inference requests across heterogeneous LLM instances in a cloud-edge computing environment. Formulated as a multi-objective optimization problem, the algorithm balances response quality, response time, and inference cost, adapting to request heterogeneity (e.g., varying complexity and prompt lengths) and node diversity (e.g., edge vs. cloud resources). This adaptive routing algorithm optimizes performance under dynamic workloads. We benchmark the approach using a testbed with datasets including Stanford Question Answering Dataset (SQuAD), Mostly Basic Python Problems (MBPP), Hella Situations With…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
