ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Haoyu Qiao; Hao Zhang; Shanwen Mao; Siyao Cheng; and Jie Liu

arXiv:2603.21237·cs.AI·March 24, 2026

ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, and Jie Liu

PDF

Open Access

TL;DR

ConsRoute is a semantic-aware adaptive query routing framework for cloud-edge-device LLM inference that improves efficiency and reduces latency by directly assessing response consistency and dynamically balancing quality and cost.

Contribution

It introduces a novel reranker-based semantic consistency assessment and cluster-specific routing thresholds, enhancing routing accuracy and efficiency over prior methods.

Findings

01

Achieves >=95% of cloud performance in inference quality.

02

Reduces latency and inference cost by nearly 40%.

03

Outperforms existing routing baselines in multiple metrics.

Abstract

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · IoT and Edge/Fog Computing