ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, and Jie Liu

TL;DR
ConsRoute is a semantic-aware adaptive query routing framework for cloud-edge-device LLM inference that improves efficiency and reduces latency by directly assessing response consistency and dynamically balancing quality and cost.
Contribution
It introduces a novel reranker-based semantic consistency assessment and cluster-specific routing thresholds, enhancing routing accuracy and efficiency over prior methods.
Findings
Achieves >=95% of cloud performance in inference quality.
Reduces latency and inference cost by nearly 40%.
Outperforms existing routing baselines in multiple metrics.
Abstract
Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing
