Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing

Shibo Yu; Mohammad Goudarzi; and Adel Nadjaran Toosi

arXiv:2507.15553·cs.DC·January 15, 2026

Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing

Shibo Yu, Mohammad Goudarzi, and Adel Nadjaran Toosi

PDF

Open Access

TL;DR

This paper presents an adaptive, multi-objective routing algorithm using NSGA-II to efficiently distribute LLM inference requests across heterogeneous cloud-edge environments, balancing quality, latency, and cost.

Contribution

It introduces a novel multi-objective optimization approach for LLM request routing that adapts to workload heterogeneity and resource diversity, improving scalability and efficiency.

Findings

01

Preserves 95.2% of cloud-only response quality

02

Reduces inference cost by 34.9%

03

Maintains slight latency increase

Abstract

The rising demand for Large Language Model (LLM) inference services has intensified pressure on computational resources, resulting in latency and cost challenges. This paper introduces a novel routing algorithm based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to distribute inference requests across heterogeneous LLM instances in a cloud-edge computing environment. Formulated as a multi-objective optimization problem, the algorithm balances response quality, response time, and inference cost, adapting to request heterogeneity (e.g., varying complexity and prompt lengths) and node diversity (e.g., edge vs. cloud resources). This adaptive routing algorithm optimizes performance under dynamic workloads. We benchmark the approach using a testbed with datasets including Stanford Question Answering Dataset (SQuAD), Mostly Basic Python Problems (MBPP), Hella Situations With…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy