Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks
Rui Bao, Nan Xue, Yaping Sun, Zhiyong Chen

TL;DR
This paper introduces a dynamic routing framework that balances inference quality and latency in wireless edge-device networks, optimizing LLM deployment between mobile devices and edge servers.
Contribution
It proposes a novel quality-latency aware routing framework with cost models for single-turn and multi-turn queries, improving response latency and reducing large model invocations.
Findings
Cuts average response latency by 5-15%.
Reduces large model invocations by 10-20%.
Effective on multiple benchmarks including MMLU, GSM8K, and MT-Bench-101.
Abstract
The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
