Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

Rui Bao; Nan Xue; Yaping Sun; Zhiyong Chen

arXiv:2508.11291·cs.IT·August 18, 2025

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

Rui Bao, Nan Xue, Yaping Sun, Zhiyong Chen

PDF

TL;DR

This paper introduces a dynamic routing framework that balances inference quality and latency in wireless edge-device networks, optimizing LLM deployment between mobile devices and edge servers.

Contribution

It proposes a novel quality-latency aware routing framework with cost models for single-turn and multi-turn queries, improving response latency and reducing large model invocations.

Findings

01

Cuts average response latency by 5-15%.

02

Reduces large model invocations by 10-20%.

03

Effective on multiple benchmarks including MMLU, GSM8K, and MT-Bench-101.

Abstract

The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.